Note
For redescription mining, one considers entities discribed by variables divided into two sets, hereafter arbitrarily called left-hand side and right-hand side. This can be seen as a pair of data matrices, where entities are identified with rows and variables with columns. Both sets of variables describe the same entities, hence, the matrices have the same number of rows.
If you provide the same dataset for the left and right hand sides, this will be interpreted as a settings with a single datasets, where variables can appear on either side of redescriptions, but not both in the same redescription. Variables can be selectively disabled on either side, to prevent them from being used in either query.
In Siren, data include:
Obviously, this is required.
Data can be imported to Siren via the interface menu
. Below, we present the data formats supported by Siren.Data can be imported into Siren as CSV files. The program expects a pair of files, one for either side in character-separated values, as can be imported and exported to and from spreadsheet programms, for instance.
There are two main formats,
The two data files need not be in the same format.
If entities names and/or coordinates are provided, they will be used to match entities across the two sides. Otherwise, rows will be match in order and an error will occur if the two side do not contain the same number of rows.
The data is stored as a table with one column for each variable and one row each entity. The first row can contain the names of the variables. The entities names can be included as columns named id. Similarly the coordinates can be included as a pair of columns named longitude and latitude, respectively.
This format allows to store data that contains few non-zero entries more compactly, as in the Matlab sparse format (or like the edge list of a bipartite graph).
Each line contains an entry of the data as a triple (entity, variable, value). This way, the data is stored as in three columns and as many rows as there are entries. In this case the first line of the data file must contains id, cid and value, indicating the three columns containing the enities, variables and corresponding value, respectively. Coordinates can be provided in a similar way under the variable names longitude and latitude.
Variable names can be provided inline, that is, simply by using the name of the variable for each entry involving it. Alternatively, variable names can be specified separately with a special “-1” entity. Similarly, entity names can be provided inline or separatly with a special “-1” variable. For example, the following five lines
id; cid; value
Espoo; population; 260981
Helsinki; population; 614074
Tampere; population; 220609
Turku; population; 182281
are equivalent to the following:
id; cid; value
20; -1; Espoo
7; -1; Tampere
2; -1; Turku
13; -1; Helsinki
-1; 3; population
2; 3; 182281
7; 3; 220609
13; 3; 614074
20; 3; 260981
Finally, in case of fully Boolean data without coordinates, the value can be left out. Each pair of (entity, variable) appearing is considered as True, the rest as False.
For both full and sparse formats a mention of type can be append to the first row, in such case all variable will be parse to the given type.
For instance, in the example above the first line would be turned to id; cid; value; type=N
to ensure that all variables, including population are interpreted as numerical (N) variables. Respectively B and C can be used to ensure that all variables are Boolean and categorical, respectively.
This can be useful when handling a dataset of numerical variables where some contains only two distinct values and might otherwise be interpreted as Boolean variables. It can also be a handy way to turn a dataset to fully Boolean based on zero/non-zero values. However, be warned that this can cause some troubles…
Note
The product of redescription mining is a list of redescriptions. A redescription consist of a pair of queries over the variables describing the entities, one query for each set. The two sets of variables are arbitrarily called left-hand side and right-hand side, and so are the corresponding queries.
The support of a query is the set of entities for which the query holds. Any given redescription partitions the entities into four sets (In the absence of missing entries):
Redescriptions can be imported to Siren via the interface menu
. More importantly, they can be exported via the interface menu and the contextual menu for a list of redescription. Below, we present the redescription formats supported by Siren.A query is formed by combining literal using Boolean operators.
While ReReMi only generate linearly parsable query (see references for more details), Siren can actually evaluates arbitrary queries, as long as they are well formed following the informal grammar below. In particular, parenthesis should be used to separated conjunctive blocks and disjunctive block, alternating between operators. For example, while the later cannot be generated by ReReMi, \((a \land{} b) \lor{} \lnot{} c\) and \((a \land{} b) \lor{} (c \land{} d)\) are both supported. \((a \land{} b) \land{} (c \land{} d)\) is not, because of incorrect alternance of operators between parenthesis blocks. It should simply be written as \(a \land{} b \land{} c \land{} d\).
We consider three types of literals, defined over a Boolean, categorical or numerical variable respectively.
Below is an unformal grammar of Siren’s query language. The actual grammar can be found in the redquery.ebnf
file in the siren.reremi
source repertory.
Tip
Naturally, the type of literal and the type of variable should match, i.e., \([4.0 \leq{} Va \leq{} 8.32]\) is a valid numerical literal only if the corresponding variable \(Va\) is a numerical variable. Furthermore, the upper bound of a numerical variable should always be greater or equal to the lower bound and either of them should be specified.
The statistics of a redescription include:
Redescriptions from the Redescriptions
tab can be exported to a file, one redescription per line, with both queries and basic statistics tab separated. Three of formatting options are available, determined by the provided filename:
*[^a-zA-Z0-9]named[^a-zA-Z0-9]*
.*[^a-zA-Z0-9]support[^a-zA-Z0-9]*
.*[^a-zA-Z0-9]all[^a-zA-Z0-9]*
, disabled redescriptions will also be printed..tex
extension, a tex file is produced that can be compiled to obtain a table of the redescriptions. Three table layouts are available, where the information for each redescription is listed respectively on one, two or three rows, if the filename matches the pattern *[^a-zA-Z0-9][1-3].[a-z]*$
. Note that this format cannot be imported back.Inside a siren package, the redescriptions are stored in tab separated format.
The fields included when exporting redescriptions and when displaying them in the interface can be set via the
menu entries.Tab separated formats can be imported into Siren, TeX cannot.