SOFA ==== This package contains the source code of the sofa algorithm from the following publication: Biclustering and Boolean Matrix Factorization in Data Streams by Stefan Neumann and Pauli Miettinen PVLDB, 13(10): 1709-1722, 2020 DOI: https://doi.org/10.14778/3401960.3401968 If you use this code in one of your papers, you need to cite the above publication; we would be happy if you also sent us a mail about your paper. If you use this code in your company, please send us a mail telling us what you use the code for. The source code is licensed under an MIT license. See LICESE file for further details. REQUIREMENTS ------------ To run the algorithm, you need to a reasonably new version of python3 (the code was tested with python3.6 and python3.8). Furthermore, the code is using the following packages: numpy, scipy, scikit-learn. If you have pipenv, you can set up a working environment with pipenv install That will install the required packages and the centermodule package bundled with this code (see below). If you do not use pipenv, you first have to compile centermodule. To compile centermodule, you need to run the following command: python3 setup.py build && python3 setup.py install If for whatever reason this does not work for you, centermodule is only used in the files OnlineFL and CoverLeftSide. Both files contain Python code (commented out) which can be used to substitute centermodule, but this will be much slower. RUNNING THE ALGORITHM --------------------- To see an exhaustive list of all parameters offered by sofa, run: python3 main.py -h To run the BMF VERSION of sofa, call the algorithm as follows: python3 main.py -k 20 --maxNumCenters 200 --MGMaxCounter 2000 --s 500 --clustering doNothing --threshold auto --minClusterSize 20 --coverLeftSide FindAllAssoStyleTopK -i INPUT_FILE (--fileIsSparse) -o RIGHT_CLUSTER_OUTPUT --leftClusterOutputFile LEFT_CLUSTER_OUTPUT (--sparseOutput) --verbose 1 To run the BICLUSTERING VERSION of sofa, call the algorithm as follows: python3 main.py -k 20 --maxNumCenters 200 --MGMaxCounter 2000 --s 500 --clustering kmeans --threshold auto --minClusterSize 20 --coverLeftSide FindOne -i INPUT_FILE (--fileIsSparse) -o RIGHT_CLUSTER_OUTPUT --leftClusterOutputFile LEFT_CLUSTER_OUTPUT (--sparseOutput) --verbose 1 USING MULTIPLE ROUNGING THRESHOLDS ---------------------------------- If you wish to use multiple rounding thresholds (Section 5.4) at once, simply pass the --threshold option multiple times. The output of the left and right clusters will then add the suffix '_$threshold' to RIGHT_CLUSTER_OUTPUT and LEFT_CLUSTER_OUTPUT. (For example, if you pass '--threshold auto --threshold 0.5 --threshold 0.8', then sofa will return files RIGHT_CLUSTER_OUTPUT_auto, RIGHT_CLUSTER_OUTPUT_0.5 and RIGHT_CLUSTER_OUTPUT_0.8.) COMMENTS ON I/O --------------- sofa can use dense and sparse inputs. - If the INPUT IS DENSE (default behavior), it should store the biadjacency matrix of the graph in a comma separated value format (i.e., it stores a Boolean m*n matrix in a csv format). - If the INPUT IS SPARSE, you need to pass --fileIsSparse. In this case, we assume that each row of the input file is of the following format: leftVtxID,rightVtxID This indicates that the graph contains an edge (leftVtxID,rightVtxID). Both vertex IDs must be non-negative integers. For each left vertex, all right neighbors must appear consecutively. (This is satisfied, for instance, if the rows are sorted by leftVtxID.) - If you want the OUTPUT TO BE DENSE, you do not have to pass any additional parameter (dense output is the default behavior). In this case, the output is a csv-file which contains one row for each cluster. If the i'th entry of the file is set to 1, then the i'th vertex of the left/right side of the graph is contained in this cluster. - If you want the OUTPUT TO BE SPARSE, you need to pass --sparseOutput. In this case, the output contains one row for each cluster. Each row contains the indices of the vertices which are part of the cluster, separated by commas.