Introduction

The research of Algorithmic Data Analysis group at UEF School of Computing can be divided into four main topics:

  • Matrix and tensor factorisations over non-standard algebras.
  • Hyberbolic community mining and modelling.
  • Redescription mining.
  • Applications of algorithmic data analysis to other fields.

The topic division is not strict; for example, in community mining we use techniques from matrix decompositions, and applications use all of the methods – and more – whatever is applicable.

Matrix and tensor factorisations over non-standard algebras

Matrix factorisations are used in data science for many different purposes, including dimensionality reduction, latent factor analysis and visualisation, to name but a few. Standard matrix factorisations such as PCA or NMF can only model linear relationships. Our approach to modelling non-linear relationships is to change the algebra under which the decomposition is done.

In Boolean matrix factorisation (BMF), the standard algebra is replaced with the Boolean algebra: Given a Boolean matrix \(A\) and an integer \(k\), the task is to find Boolean matrices \(B\) and \(C\) such that \[ \sum_{i,j}\biggl\lvert A_{ij} - \bigvee_{l=1}^k (B_{il}C_{lj})\biggr\rvert \] is minimized. Here \(\bigl(\bigvee_{l=1}^k B_{il}C_{lj}\bigr)_{ij}\) is the Boolean matrix product of \(B\) and \(C\). Further information about BMF can be found from our tutorial. Other work includes also Boolean shared subspace factorisations, dynamic Boolean matrix factorisations, choosing the rank using MDL principle, MDL-optimising BMF, and ordered BMF.

The Boolean matrix factorisation setup can naturally be extended to multi-way data, that is, tensors. This has applications, for example, in information extraction or RDF databases. Further information about Boolean tensors can be found from dedicated pages.

Boolean algebra is not the only algebra we have studied. Another algebra we have studied extensively is the subtropical algebra. The subtropical algebra is defined over the nonnegative real numbers \(\mathbb{R}_{\geq 0}\) with the multiplication defined as usual and the addition defined as \(x + y = \max\{x,y\}\). Hence, subtropical algebra is the semi-ring \((\mathbb{R_{\geq 0}}, \max, \times, 0, 1)\). Decompositions over subtropical algebra can be used to find ‘winner takes it all’ decompostions: instead of the typical NMF interpretation of each rank-1 component providing an additive part of the whole, subtropical decompositions explain the values in the data by their largest value in any component. This leads to naturally sparser decompositions, and to the ability to observe different types of structures than what is possible with NMF.

Further information about our research on subtropical factorisations can be found from its dedicated pages.

Hyperbolic communities

Community detection needs no introduction. The key problem in social network analysis has mostly concentrated on finding communities that are (quasi-)cliques, that is, communities where everyone knows (almost) everybody else. But communities might not look like cliques; indeed, they probably have a small ‘core’ inside them that does look like a clique, but they can also contain a large ‘tail’, where people only know the people in the core.

We propose to model such communities using our hyperbolic community model: two vertices, \(i\) and \(j\), ordered in descending order by degree, are connected to each other if and only if \((i+p)(j+p)\leq \theta\) for some parameters \(p\) and \(\theta\). Further information about the model, a random graph generator for it, and our results on analysing Q&A networks can be found from dedicated pages.

Redescription mining

Redescription mining aims at finding two ways to explain the same entities. To give a simple example, the redescription \[ \text{Eurasian lynx}\lor\text{Canada lynx} \sim [-24.4 \leq t^+_3 \leq 3.4] \] describes certain areas in the northern hemisphere as those areas that, on one hand, are inhabited by either Eurasian lynx or Canada lynx, and that, on the other hand, have maximum March temperature ranging from -24.4℃ to 3.4℃. To learn more about redescription mining, visit the web page of Siren software or read our book or our overview article at IEEE Intelligent Informatics Bulletin.

Applications

We participate in a wide range of applied projects regarding data science and data analysis at UEF. These include
  • North-Savonian Regional SOTE AI-Hub (ERDF/Pohjois-Savo Regional Council)
  • AI in lawmaking (Ministry of Transport and Communications)
  • Kuopio Biomedical Image Analysis Centre KUBIAC (in steering group)
  • UEF Centre for Bioinformatics (in steering group)
  • Academy of Finland consortium Utilizing Digital Data to Improve Working Conditions in Health Care (in steering group)