HyGen - Random Graph Generator for Graphs with Hyperbolic Community Structure
-----------------------------------------------------------------------------
This is the code package to accompany our work "Random Graph Generators for
Hyperbolic Community Structures". The code for HyGen is available in Matlab and
is based on our previous work "Modelling Communities That Are Not Cliques". The
code of that can be found under
http://people.mpi-inf.mpg.de/~pmiettin/hybobo/
Before you proceed here, please download that source code, follow its set-up
instructions and make sure it is in your Matlab path.
Choose now, if you want to re-do our experiments, or create own HyGen graphs.
-- In the latter case, you can jump immediatly to the last section.
----------------- Code and Data for our Experiments -----------------
Data
----
Within this package, we provide the parameter distributions we computed on
Amazon, DBLP, Friendster, and YouTube as CSV files. For each of these data
sets, that originate from SNAP (http://snap.stanford.edu/data), we have drawn
a sample of 500 communities of size between 100 and 1000 nodes and computed
their hyperbolic models. The columns in each file are the following:
1. Tail size H
2. Core size gamma
3. Community size
4. Number of edges inside the community
5. Number of edges outside the community
6. Inside-community area, i.e. number of possible edges inside
7. Outside-community area, i.e. number of possible edges outside
(among the nodes that take part in the community)
Also we provide versions of the Amazon, DBLP, Friendster, and YouTube data
where 500 non-overlapping communities of size 100 to 1000 nodes are in each
graph. YouTube has only 129 since this are already all communities of the
respective size in the original data. The graphs are stored by community in
the field called "res" along with their respective hyperbolic model. Each
field "comm" contains the subgraph forming the community; "members" lists the
corresponding node indices within the full graph.
fitDistribution_tabularEval.m
-----------------------------
Tests the fitting quality of different distribution functions. The result
is summarized in boxplots as displayed in Figure 2 in the manuscript. This
script requires the Matlab "allfitdist" by Mike Sheppard (2012), which can
be found under
https://github.com/dcherian/tools/blob/master/misc/allfitdist.m
Notice that the program "pdfcrop" is required to crop the white margins around
the figures after saving.
generateFromFit.m
-----------------
Given empirical distributions of the parameters, this script generates a
number of new random graphs (100 per dataset in our experiments) with the
same number of communities and similar parameter distributions. In addition
it fits new hyperbolic models on this result.
isValidParameterConfiguration.m
-------------------------------
Helper function to check if the sampled parameters are a valid combination.
fit_lfrOnRealData.m
-------------------
With this function, LFR can be tried on the provided data. Hyperbolic models
of the communities LFR provides are stored as the result. The LFR code is
expected to be installed within the same folder as this script and can be
obtained from
http://santo.fortunato.googlepages.com/benchmark.tgz
fit_sbmOnRealData.m
-------------------
With this function, DC-SBM can be tried on the provided data. Hyperbolic models
of the communities SBM provides are stored as the result. To run this function,
the SBM code is expected to be installed within the same folder. It can be
obtained from
https://github.com/ntamas/blockmodel
Notice that this package contains two different executables, "block-gen" and
"block-fit". Since no ground-truth community information can be obtained from
"block-gen", we need to assume that "block-fit" well finds the communities that
"block-gen" generated.
boxplotsOfRealDataResults.m
---------------------------
This script compares the parameter distributions of hyperbolic models from
HyGen, LFR, and DC-SBM to those on the original data. The result is displayed
as Figure 3 in the manuscript. Note that this function acts on the output of
generateFromFit.m, fit_lfrOnRealData.m, and fit_sbmOnRealData.m.
evaluateGeneratedGraphs.m
-------------------------
Computes the relative conditional entropy for each generated graph and outputs
its mean per data set. The results are reported in Section 8 of the manuscript.
conditionalEntropy.m
--------------------
Helper to compute the conditional entropy of graph Y given graph X.
entropy.m
---------
Helper to compute the entropy of graph Y.
----------------- Creating HyGen Graphs -----------------
While our experiments mainly focus on the creation of random graphs based on
observations from real-world data, we also provide an interface to directly
specify arbitrary distribution functions in either parametrization of the
hyperbolic model. Examples for both interfaces are given below.
createGraphFromObservations.m
-----------------------------
This function creates a random graph following observed parameter distributions
for the core size gamma, the tail height H, and the sizes of communities. The
number of communities in the generated graph is a parameter of this function.
Furthermore the densities of the inside-community and outside-community area
need to be provided. For the output, one can choose between a list with all the
communities, each accompanied by its parameters and its global node indices, or
an adjacency matrix with all hyperbolic communities already planted. The latter
will not carry the information about the community parameters anymore, but
might be useful for further processing.
Example of how to get a small random graph derived from the YouTube data:
dataArray = textscan(fopen('Data/paramDist_gammaH_youtube.csv','r'), ...
'%f%f%f%f%f%f%f%[^\n\r]', 'Delimiter', ',',...
'ReturnOnError', false);
Hs = dataArray{:,1};
gammas = dataArray{:,2};
sizes = dataArray{:,3};
averageInsideDensity = mean(dataArray{:,4}./dataArray{:,6});
averageOutsideDensity = mean(dataArray{:,5}./dataArray{:,7});
numCommunities = 10;
adjacencyOnly = true;
G = createGraphFromObservations(gammas,Hs,sizes,averageInsideDensity,...
averageOutsideDensity,numCommunities,...
adjacencyOnly);
createGraph.m
-------------
This function creates a random graph following the specified parameter
distributions. Parameter distributions may be specified using either definition
of the hyperbolic model, i.e. 'H,gamma', 'p,Theta', or 'x,Sigma'. The number
of communities and their minimal and maximal size in the generated graph are a
parameter of this function.
Furthermore the densities of the inside-community and outside-community area
need to be provided. For the output, one can choose between a list with all the
communities, each accompanied by its parameters and its global node indices, or
an adjacency matrix with all hyperbolic communities already planted. The latter
will not carry the information about the community parameters anymore, but
might be useful for further processing.
Example or how to obtian small random graphs using 'p,Theta' or 'x,Sigma':
numCommunities = 10;
adjacencyOnly = true;
G = createGraph('p,Theta',makedist('Normal'), ...
makedist('Normal', 'mu', 75, 'sigma', 10), ...
makedist('Normal', 'mu', 100, 'sigma', 1), ...
[50,150], 0.3, 0.03, numCommunities, adjacencyOnly);
G = createGraph('x,Sigma',makedist('Normal'), ...
makedist('Normal', 'mu', 75, 'sigma', 10), ...
makedist('Normal', 'mu', 100, 'sigma', 1), ...
[50,150], 0.3, 0.03, numCommunities, adjacencyOnly);