HyGen - Random Graph Generator for Graphs with Hyperbolic Community Structure ----------------------------------------------------------------------------- This is the code package to accompany our work "Random Graph Generators for Hyperbolic Community Structures". The code for HyGen is available in Matlab and is based on our previous work "Modelling Communities That Are Not Cliques". The code of that can be found under http://people.mpi-inf.mpg.de/~pmiettin/hybobo/ Before you proceed here, please download that source code, follow its set-up instructions and make sure it is in your Matlab path. Choose now, if you want to re-do our experiments, or create own HyGen graphs. -- In the latter case, you can jump immediatly to the last section. ----------------- Code and Data for our Experiments ----------------- Data ---- Within this package, we provide the parameter distributions we computed on Amazon, DBLP, Friendster, and YouTube as CSV files. For each of these data sets, that originate from SNAP (http://snap.stanford.edu/data), we have drawn a sample of 500 communities of size between 100 and 1000 nodes and computed their hyperbolic models. The columns in each file are the following: 1. Tail size H 2. Core size gamma 3. Community size 4. Number of edges inside the community 5. Number of edges outside the community 6. Inside-community area, i.e. number of possible edges inside 7. Outside-community area, i.e. number of possible edges outside (among the nodes that take part in the community) Also we provide versions of the Amazon, DBLP, Friendster, and YouTube data where 500 non-overlapping communities of size 100 to 1000 nodes are in each graph. YouTube has only 129 since this are already all communities of the respective size in the original data. The graphs are stored by community in the field called "res" along with their respective hyperbolic model. Each field "comm" contains the subgraph forming the community; "members" lists the corresponding node indices within the full graph. fitDistribution_tabularEval.m ----------------------------- Tests the fitting quality of different distribution functions. The result is summarized in boxplots as displayed in Figure 2 in the manuscript. This script requires the Matlab "allfitdist" by Mike Sheppard (2012), which can be found under https://github.com/dcherian/tools/blob/master/misc/allfitdist.m Notice that the program "pdfcrop" is required to crop the white margins around the figures after saving. generateFromFit.m ----------------- Given empirical distributions of the parameters, this script generates a number of new random graphs (100 per dataset in our experiments) with the same number of communities and similar parameter distributions. In addition it fits new hyperbolic models on this result. isValidParameterConfiguration.m ------------------------------- Helper function to check if the sampled parameters are a valid combination. fit_lfrOnRealData.m ------------------- With this function, LFR can be tried on the provided data. Hyperbolic models of the communities LFR provides are stored as the result. The LFR code is expected to be installed within the same folder as this script and can be obtained from http://santo.fortunato.googlepages.com/benchmark.tgz fit_sbmOnRealData.m ------------------- With this function, DC-SBM can be tried on the provided data. Hyperbolic models of the communities SBM provides are stored as the result. To run this function, the SBM code is expected to be installed within the same folder. It can be obtained from https://github.com/ntamas/blockmodel Notice that this package contains two different executables, "block-gen" and "block-fit". Since no ground-truth community information can be obtained from "block-gen", we need to assume that "block-fit" well finds the communities that "block-gen" generated. boxplotsOfRealDataResults.m --------------------------- This script compares the parameter distributions of hyperbolic models from HyGen, LFR, and DC-SBM to those on the original data. The result is displayed as Figure 3 in the manuscript. Note that this function acts on the output of generateFromFit.m, fit_lfrOnRealData.m, and fit_sbmOnRealData.m. evaluateGeneratedGraphs.m ------------------------- Computes the relative conditional entropy for each generated graph and outputs its mean per data set. The results are reported in Section 8 of the manuscript. conditionalEntropy.m -------------------- Helper to compute the conditional entropy of graph Y given graph X. entropy.m --------- Helper to compute the entropy of graph Y. ----------------- Creating HyGen Graphs ----------------- While our experiments mainly focus on the creation of random graphs based on observations from real-world data, we also provide an interface to directly specify arbitrary distribution functions in either parametrization of the hyperbolic model. Examples for both interfaces are given below. createGraphFromObservations.m ----------------------------- This function creates a random graph following observed parameter distributions for the core size gamma, the tail height H, and the sizes of communities. The number of communities in the generated graph is a parameter of this function. Furthermore the densities of the inside-community and outside-community area need to be provided. For the output, one can choose between a list with all the communities, each accompanied by its parameters and its global node indices, or an adjacency matrix with all hyperbolic communities already planted. The latter will not carry the information about the community parameters anymore, but might be useful for further processing. Example of how to get a small random graph derived from the YouTube data: dataArray = textscan(fopen('Data/paramDist_gammaH_youtube.csv','r'), ... '%f%f%f%f%f%f%f%[^\n\r]', 'Delimiter', ',',... 'ReturnOnError', false); Hs = dataArray{:,1}; gammas = dataArray{:,2}; sizes = dataArray{:,3}; averageInsideDensity = mean(dataArray{:,4}./dataArray{:,6}); averageOutsideDensity = mean(dataArray{:,5}./dataArray{:,7}); numCommunities = 10; adjacencyOnly = true; G = createGraphFromObservations(gammas,Hs,sizes,averageInsideDensity,... averageOutsideDensity,numCommunities,... adjacencyOnly); createGraph.m ------------- This function creates a random graph following the specified parameter distributions. Parameter distributions may be specified using either definition of the hyperbolic model, i.e. 'H,gamma', 'p,Theta', or 'x,Sigma'. The number of communities and their minimal and maximal size in the generated graph are a parameter of this function. Furthermore the densities of the inside-community and outside-community area need to be provided. For the output, one can choose between a list with all the communities, each accompanied by its parameters and its global node indices, or an adjacency matrix with all hyperbolic communities already planted. The latter will not carry the information about the community parameters anymore, but might be useful for further processing. Example or how to obtian small random graphs using 'p,Theta' or 'x,Sigma': numCommunities = 10; adjacencyOnly = true; G = createGraph('p,Theta',makedist('Normal'), ... makedist('Normal', 'mu', 75, 'sigma', 10), ... makedist('Normal', 'mu', 100, 'sigma', 1), ... [50,150], 0.3, 0.03, numCommunities, adjacencyOnly); G = createGraph('x,Sigma',makedist('Normal'), ... makedist('Normal', 'mu', 75, 'sigma', 10), ... makedist('Normal', 'mu', 100, 'sigma', 1), ... [50,150], 0.3, 0.03, numCommunities, adjacencyOnly);