SOFA
====
This package contains the source code of the sofa algorithm from the following
publication:
	Biclustering and Boolean Matrix Factorization in Data Streams
	by Stefan Neumann and Pauli Miettinen
	PVLDB, 13(10): 1709-1722, 2020
	DOI: https://doi.org/10.14778/3401960.3401968
If you use this code in one of your papers, you need to cite the above
publication; we would be happy if you also sent us a mail about your paper.
If you use this code in your company, please send us a mail telling us what
you use the code for.  The source code is licensed under an MIT license.
See LICESE file for further details.


REQUIREMENTS
------------
To run the algorithm, you need to a reasonably new version of python3 (the 
code was tested with python3.6 and python3.8). Furthermore, the code is 
using the following packages: numpy, scipy, scikit-learn.

If you have pipenv, you can set up a working environment with 
    pipenv install
That will install the required packages and the centermodule package bundled 
with this code (see below).

If you do not use pipenv, you first have to compile centermodule. To compile
centermodule, you need to run the following command:
	python3 setup.py build && python3 setup.py install
If for whatever reason this does not work for you, centermodule is only used in
the files OnlineFL and CoverLeftSide. Both files contain Python code (commented
out) which can be used to substitute centermodule, but this will be much slower.


RUNNING THE ALGORITHM
---------------------
To see an exhaustive list of all parameters offered by sofa, run:
	python3 main.py -h

To run the BMF VERSION of sofa, call the algorithm as follows:
	python3 main.py -k 20 --maxNumCenters 200 --MGMaxCounter 2000 --s 500 --clustering doNothing --threshold auto --minClusterSize 20 --coverLeftSide FindAllAssoStyleTopK -i INPUT_FILE (--fileIsSparse) -o RIGHT_CLUSTER_OUTPUT --leftClusterOutputFile LEFT_CLUSTER_OUTPUT (--sparseOutput) --verbose 1

To run the BICLUSTERING VERSION of sofa, call the algorithm as follows:
	python3 main.py -k 20 --maxNumCenters 200 --MGMaxCounter 2000 --s 500 --clustering kmeans --threshold auto --minClusterSize 20 --coverLeftSide FindOne -i INPUT_FILE (--fileIsSparse) -o RIGHT_CLUSTER_OUTPUT --leftClusterOutputFile LEFT_CLUSTER_OUTPUT (--sparseOutput) --verbose 1


USING MULTIPLE ROUNGING THRESHOLDS
----------------------------------
If you wish to use multiple rounding thresholds (Section 5.4) at once, simply
pass the --threshold option multiple times. The output of the left and right
clusters will then add the suffix '_$threshold' to RIGHT_CLUSTER_OUTPUT and
LEFT_CLUSTER_OUTPUT.
(For example, if you pass '--threshold auto --threshold 0.5 --threshold 0.8',
 then sofa will return files RIGHT_CLUSTER_OUTPUT_auto,
 RIGHT_CLUSTER_OUTPUT_0.5 and RIGHT_CLUSTER_OUTPUT_0.8.)


COMMENTS ON I/O
---------------
sofa can use dense and sparse inputs.
- If the INPUT IS DENSE (default behavior), it should store the biadjacency
	matrix of the graph in a comma separated value format (i.e., it stores a
	Boolean m*n matrix in a csv format).
- If the INPUT IS SPARSE, you need to pass --fileIsSparse. In this case, we
	assume that each row of the input file is of the following format:
		leftVtxID,rightVtxID
	This indicates that the graph contains an edge (leftVtxID,rightVtxID). Both
	vertex IDs must be non-negative integers. For each left vertex, all right
	neighbors must appear consecutively. (This is satisfied, for instance, if
	the rows are sorted by leftVtxID.)
- If you want the OUTPUT TO BE DENSE, you do not have to pass any additional
	parameter (dense output is the default behavior). In this case, the output
	is a csv-file which contains one row for each cluster. If the i'th entry of
	the file is set to 1, then the i'th vertex of the left/right side of the
	graph is contained in this cluster.
- If you want the OUTPUT TO BE SPARSE, you need to pass --sparseOutput.
	In this case, the output contains one row for each cluster. Each row
	contains the indices of the vertices which are part of the cluster,
	separated by commas.