Liza Levina is the Vijay Nair Collegiate Professor of Statistics at the University of Michigan, as well as affiliated faculty at the Michigan Institute for Data Science and the Center for the Study of Complex Systems. She received her PhD in Statistics from UC Berkeley in 2002, and has been at the University of Michigan since. She is well known for her work on high-dimensional statistical inference and statistical network analysis. Her current application interests are focused on neuroimaging. She is a recipient of the ASA Noether Young Scholar Award, a fellow of the ASA and the IMS, a 2016 Web of Science Highly Cited Researcher, and an invited speaker at the 2018 International Congress of Mathematicians. She has served the IMS in multiple capacities and is currently a council member and a co-chair of the IMS Task Force on Data Science. Liza Levina will deliver her Medallion Lecture at JSM Denver, July 27–August 1, 2019.

 

Hierarchical communities in networks

Network data have become increasingly common in many fields, with interesting scientific phenomena discovered through the analysis of biological, social, ecological, and various other networks. Among various network analysis tasks, community detection (the task of clustering network nodes into groups with similar connection patterns) has been one of the most studied, due to the ubiquity of communities in real-world networks and the appealing mathematical formulations that lend themselves to analysis. For the most part, community detection has been formulated as the problem of finding a single partition of the network into some “correct” number of communities. However, it is both well known in practice and supported by theory that nearly all the algorithms and models proposed for this type of community detection do not work well when the number of communities is large. We argue that for large networks, a hierarchy of communities is preferable to such a partition, since multiple partitions at different scales frequently make more sense in real networks, and the hierarchy can be scientifically meaningful, like an evolutionary tree. A hierarchical tree, with larger communities subdivided into smaller ones, offers a natural and very interpretable representation of community structure, and simplifies the problem of estimating the potentially large number of communities from the entire network. In addition, a hierarchy gives us much more information than any “flat” partition, by indicating how close communities are through their tree distance. Finally, recursive splitting is more computationally efficient, and, as we show, in some settings is more accurate. In particular, we show that even when the full community structure corresponding to the leaves of the tree is below the recovery threshold, we can still consistently recover the top levels of the tree as long as they are well separated, giving us partial but accurate information where a flat partition method would fail.

Many existing algorithms for hierarchical clustering can be modified to apply to networks. We adopt a simple top-down recursive partitioning algorithm, once popular in the clustering literature. It requires two tools that, in turn, can be chosen among many existing methods: an algorithm to partition a given network into two, and a stopping rule to decide whether there is more than one community in a given subnetwork. Given these two tools, the recursive (bi-)partitioning algorithm proceeds by starting with all nodes in one community, applying the stopping rule to decide whether a split is needed, applying the splitting algorithm to split into two communities if so, and continuing to apply this to every resulting subnetwork until the stopping rule indicates there are no further splits to make. This class of algorithms can be made model-free and tuning-free, and is computationally efficient, with the computational cost growing logarithmically in the number of communities rather than linearly, which is the case for most flat partition methods. We implement recursive partitioning by using regularized spectral clustering as the splitting rule, and the Bethe-Hessian estimator of the number of communities as the stopping rule, although any other consistent method can be used instead.

We analyze the algorithm’s theoretical performance under a natural framework for this setting, the binary tree stochastic block model. Under this model, we prove that the algorithm correctly recovers the entire community tree under mild growth assumptions on the average degree, allowing for sparse networks. Further, the assumptions to recover each level of the tree, which we make explicit, get strictly stronger as we move down the tree, illuminating the regime where recursive partitioning can correctly recover mega-communities at the higher levels of the hierarchy even when it cannot recover every community at the bottom of the tree. We show that in practice recursive partitioning outperforms “flat” spectral clustering on multiple performance metrics when the number of communities is large, and illustrate the algorithm on a dataset of statistics papers, constructing a highly interpretable tree of statistics research communities.

This is joint work with Tianxi Li (Univ. Virginia), Lihua Lei (UC Berkeley), Sharmodeep Bhattacharyya (Oregon State Univ.), Purnamrita Sarkar (Univ. Texas, Austin), and Peter J. Bickel (UC Berkeley). The manuscript is available at arXiv:1810.01509.