This program uses a local search optimization to find an approximately optimal clustering of a undirected graph according to some objective function. Many different objectives are supported. There are bindings for octave and MATLAB.
Formally, the problem that is solved is to find a clustering C of the nodes in a graph that minimizes the loss
loss(C) = f(sum_{c ∈ C} g(c))
The optimization method is the one introduced by Blondel et.al. [2], and also used by [1] and [3].
To build the stand alone version:
make
To build the octave interface:
octave
make_octave
To build the matlab interface:
matlab
make_matlab
for the Octave and MATLAB bindings, the usage is:
[c,l,k] = lso_cluster(A, <optional parameters>);
Where:
A
is a symmetric weighted adjacency matrix. It can be sparse.- The optional parameters are a list of
'name',value
pairs, see below. c
is the labeling that represents the clustering. The first cluster will have label0
, the next label1
, etc.l
is the objective (loss) value of the clustering.k
is the number of clusters,k = numel(unique(c))
.
Optional parameters regarding the clustering:
-
'eval', clustering
Evaluate the objective function onclustering
, do not optimize. -
'init', clustering
Use the givenclustering
as initial value for the optimization. Default: each node is initially in a separate cluster, i.e.clustering=1:length(A)
.
Optional parameters regarding the objective:
-
'loss', loss
Use the given loss/objective function. The loss is given a string name. See below for a list of supported loss functions. Default:loss = 'modularity'
-
'total_volume', m
Replace the total volume (sum of edges) bym
. Many objectives use the total volume for normalization, and changing it will change the scale at which clusters are found. Usually increasing the total volume will result in larger clusters. Default:m = sum(sum(A))
-
'extra_loss', alpha
Add a term to the loss function that penalizes the volume of clusters, with weightalpha
. -
'num_clusters', n
Force the solution to have the given number of clusters. The algorithm uses a binary search to alter the objective until it finds a solution with the given number of clusters. The alteration is the same as the one used byextra_loss
. -
'min_num_cluster', n
Force the solution to have at least the given number of clusters. -
'max_num_cluster', n
Force the solution to have at most the given number of clusters. -
'max_cluster_size', n
Allow clusters to have at most n nodes.
Optional parameters about internal algorithm details, you only need these if you know what you are doing:
-
'seed', random_seed
Use a given random seed. By default a fixed seed is used, so repeated runs with the same input give the same output. -
'num_repeats', n
Repeat the search n times from scratch with different random seeds and return the best result. Default: 1 -
'num_partitions', n
Number of times to try and break apart the clusters and re-cluster them Default: 0 -
'optimize_exhaustive', bool
Use an exhaustive search instead of local search. This is of course very slow. Can be combined withmax_num_cluster
. Default: false. -
'optimize_higher_level', bool
Use a hierarchical optimizer, where small clusters are considered as nodes of a higher level graph. Default: true. -
'always_consider_empty', bool
Always consider the move of a node into a new singleton cluster. Default: true. -
'num_loss_tweaks', n
Maximum number of iterations in the binary search to force the specified number of clusters. Default: 32 -
'check_invariants', bool
Check invariants of the algorithm (for debugging). Default: false. -
'trace_file', filename
Write a trace of the steps performed by the optimization algorithm to a file in JSON format. -
'verbose', level
Level of debug output. Levels go up to 7 and are increasingly chatty. Default:level = 0
, i.e. no output.
The optional parameter 'loss'
specifies the loss function to use. The default is modularity.
In this section the notation is:
v_c: The volume of cluster c, i.e. the sum of degrees/strengths of nodes in c.
In MATLAB notation:
v_c = sum(sum(A(C==c,:)))
w_c: The weight of edges within cluster c
In MATLAB notation:
w_c = sum(sum(A(C==c,C==c))
n_c: The number of nodes in cluster c
m: The total volume of the graph,
m = v_V, where V is the set of all nodes.
This is altered by the 'total_volume' option.
Since the loss is minimized, some objectives are negated compared to their usual definition.
Some of the supported loss functions are:
-
'modularity'
,loss = -sum_c (w_c/m - v_c^2/m^2)
This is the negation of the usual definition -
'infomap'
: The infomap objective by [3]. -
'ncut'
: Normalized cut,loss = sum_c (v_c - w_c) / n_c
-
'rcut'
: Ratio cut,loss = sum_c (v_c - w_c) / v_c
-
{'pmod',p}
: Modularity with a different power,loss = -sum_c (w_c/m - (v_c/m)^p / (p-1))
-
{'mom',m}
: Monotonic variant of modularity,loss = -sum_c (w_c/(m + 2v_c) - (v_c/(m + 2v_c))^2)
-
'w-log-v'
,loss = sum_c (w_c/m * log(v_c) )
-
num
loss = |C|
: Minimize the number of clusters, this leads to finding the connected components.
See loss_functions.hpp
for the full list.
Some loss functions have parameters, these are passed as a cell array where the first element is the name.
Find the clustering of a graph by optimizing modularity:
% Build a graph that has a cluster structure
A = (blkdiag(rand(4), rand(5), rand(6)) > 1-0.5) | (rand(15) > 1-0.05);
A = A | A';
% Find clusterings
c = greedy_cluster(A);
% c = [0 0 0 0 1 1 1 1 1 2 2 2 2 2 2]'
Evaluate the modularity of a given clustering:
c = [0 0 0 0 5 5 5 1 1 2 2 2 2 2 2]
[ignored, l] = greedy_cluster(A, 'eval',c);
l
% l = -0.40633
Optimize infomap:
c = greedy_cluster(A, 'loss','infomap');
Find a solution with exactly 3 clusters:
c = greedy_cluster(A, 'num_clusters',3);
Here is a larger example:
% An LFR graph with mixing 0.6 and 1000 nodes (See [4])
% The gaph is in A, the ground truth is in c
> load example-LFR.mat
> [d,l,k] = lso_cluster(A,'loss','modularity');
> normalized_mutual_information(c,d)
ans = 0.93011
% This solution has too few clusters
> k
k = 25
> numel(unique(c))
ans = 41
% So, force the number of clusters to be 41
> [d,l,k] = lso_cluster(A,'loss','modularity','num_clusters',41);
> normalized_mutual_information(c,d)
ans = 1
% Or use a different loss function
> [d,l,k] = lso_cluster(A,'loss','w-log-v');
> normalized_mutual_information(c,d)
ans = 1
Graphs should be given on in a text file whos name is passed as the first argument (or -
for stdin). Each line should look like:
<from> <to> <weight>
Where <from>
and <to>
are node identifiers. These are either strings or integers from 0...n
. Only nodes that have edges will be included in the graph. If needed, you can add edges <n> <n> 0
to force the graph to contain a node n
.
The graph is made symmetric, i.e. the reversed edges <to> <from> <weight>
are also added.
The output is written as text to stdout, with -o <filename>
the found clustering is directed to a file instead.
Each line of the output will have the format
<node> <clusterid>
Nodes will be listed in alphabetical order, or in numerical order if the --numeric
flag is given.
Other parameters are specified as --parameter value
, with the same name as the matlab version parameters. For example --loss infomap
uses the infomap loss function.
$ lso_cluster example.in
[1] Graph clustering: does the optimization procedure matter more than the objective function?; Twan van Laarhoven and Elena Marchiori; Physical Review E 87, 012812 (2013) [pdf] [publisher] [website]
[2] Fast unfolding of communities in large networks; Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte and Etienne Lefebvre; J. Stat. Mech. Theory Exp. 2008, P10008 (2008), [publisher] [arxiv]
[3] Maps of random walks on complex networks reveal community structure; M. Rosvall and C. T. Bergstrom; Proc. Natl. Acad. Sci. USA 105, 1118 (2008). [publisher] [arxiv] [code]
[4] Benchmark graphs for testing community detection algorithms; A. Lancichinetti, S. Fortunato and F. Radicchi; Physical Review E 78, 046110 (2008) [pdf] [website]