Fundamentals

Mining frequent itemsets

A.1

Aggarwal, Charu C (2015). Data Mining: The Textbook. Springer. doi:10.1007/978-3-319-14142-8. PDF. BibTeX.

@book{aggarwal_data_2015,
    author = {Aggarwal, Charu C},
    title = {Data Mining: The Textbook},
    publisher = {Springer},
    year = {2015},
    doi = {10.1007/978-3-319-14142-8}
}

A.2

Agrawal, Rakesh, Tomasz Imieliński, and Arun Swami (1993). Mining association rules between sets of items in large databases. ACM SIGMOD Record, 22(2):207–216. doi:10.1145/170036.170072. PDF. BibTeX.

@article{agrawal_mining_1993,
    author = {Agrawal, Rakesh and Imieliński, Tomasz and Swami, Arun},
    title = {Mining association rules between sets of items in large databases},
    volume = {22},
    doi = {10.1145/170036.170072},
    pages = {207--216},
    number = {2},
    journal = {{ACM} {SIGMOD} Record},
    year = {1993}
}

Abstract.

We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.

A.3

Agrawal, Rakesh and Ramakrishnan Srikant (1994). Fast algorithms for mining association rules. In Proceedings of 20th International Conference on Very Large Data Bases, VLDB '94, 487–499. Morgan Kaufmann. PDF. BibTeX.

@inproceedings{agrawal_fast_1994,
    author = {Agrawal, Rakesh and Srikant, Ramakrishnan},
    title = {Fast Algorithms for Mining Association Rules},
    pages = {487--499},
    booktitle = {Proceedings of 20th International Conference on Very Large Data Bases, {VLDB} '94},
    publisher = {Morgan Kaufmann},
    year = {1994}
}

Abstract.

We consider the problem of discovering association rules between items in a large database of sales transactions. We present two new algorithms for solving this problem that are fundamentally different from the known algorithms. Experiments with synthetic as well as real-life data show that these algorithms outperform the known algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems. We also show how the best features of the two proposed algorithms can be combined into a hybrid algorithm, called AprioriHybrid. Scale-up experiments show that AprioriHybrid scales linearly with the number of transactions. AprioriHybrid also has excellent scale-up properties with respect to the transaction size and the number of items in the database.

A.4

Bastide, Yves et al. (2000). Mining minimal non-redundant association rules using frequent closed itemsets. In Proceedings of the International Conference on Computational Logic, CL '00, Lecture Notes in Computer Science, 972–986. Springer. doi:10.1007/3-540-44957-4_65. PDF. BibTeX.

@inproceedings{bastide_mining_2000,
    author = {Bastide, Yves and Pasquier, Nicolas and Taouil, Rafik and Stumme, Gerd and Lakhal, Lotfi},
    title = {Mining Minimal Non-redundant Association Rules Using Frequent Closed Itemsets},
    doi = {10.1007/3-540-44957-4_65},
    series = {Lecture Notes in Computer Science},
    pages = {972--986},
    booktitle = {Proceedings of the International Conference on Computational Logic, {CL} '00},
    publisher = {Springer},
    year = {2000}
}

Abstract.

The problem of the relevance and the usefulness of extracted association rules is of primary importance because, in the majority of cases, real-life databases lead to several thousands association rules with high confidence and among which are many redundancies. Using the closure of the Galois connection, we define two new bases for association rules which union is a generating set for all valid association rules with support and confidence. These bases are characterized using frequent closed itemsets and their generators; they consist of the non-redundant exact and approximate association rules having minimal antecedents and maximal consequents, i.e. the most relevant association rules. Algorithms for extracting these bases are presented and results of experiments carried out on real-life databases show that the proposed bases are useful, and that their generation is not time consuming.

A.5

Bonchi, Francesco and Claudio Lucchese (2004). On closed constrained frequent pattern mining. In Proceedings of the 4th IEEE International Conference on Data Mining, ICDM '04, 35–42. IEEE Computer Society. doi:10.1109/ICDM.2004.10093. PDF. BibTeX.

@inproceedings{bonchi_closed_2004,
    author = {Bonchi, Francesco and Lucchese, Claudio},
    title = {On Closed Constrained Frequent Pattern Mining},
    doi = {10.1109/ICDM.2004.10093},
    pages = {35--42},
    booktitle = {Proceedings of the 4th {IEEE} International Conference on Data Mining, {ICDM} '04},
    publisher = {{IEEE} Computer Society},
    year = {2004}
}

Abstract.

Constrained frequent patterns and closed frequent patterns are two paradigms aimed at reducing the set of extracted patterns to a smaller, more interesting, subset. Although a lot of work has been done with both these paradigms, there is still confusion around the mining problem obtained by joining closed and constrained frequent patterns in a unique framework. In this paper we shed light on this problem by providing a formal definition and a thorough characterization. Wealso study computational issues and show how to combine the most recent results in both paradigms, providing a very efficient algorithm which exploits the two requirements (satisfying constraints and being closed) together at mining time in order to reduce the computation as much as possible.

A.6

Brin, Sergey et al. (1997). Dynamic itemset counting and implication rules for market basket data. ACM SIGMOD Record, 26(2):255–264. doi:10.1145/253262.253325. PDF. BibTeX.

@article{brin_dynamic_1997,
    author = {Brin, Sergey and Motwani, Rajeev and Ullman, Jeffrey D. and Tsur, Shalom},
    title = {Dynamic itemset counting and implication rules for market basket data},
    volume = {26},
    doi = {10.1145/253262.253325},
    pages = {255--264},
    number = {2},
    journal = {{ACM} {SIGMOD} Record},
    year = {1997}
}

Abstract.

We consider the problem of analyzing market-basket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling. We investigate the idea of item reordering, which can improve the low-level efficiency of the algorithm. Second, we present a new way of generating “implication rules,” which are normalized based on both the antecedent and the consequent and are truly implications (not simply a measure of co-occurrence), and we show how they produce more intuitive results than other methods. Finally, we show how different characteristics of real data, as opposed by synthetic data, can dramatically affect the performance of the system and the form of the results.

A.7

Mannila, Heikki, Hannu Toivonen, and A Inkeri Verkamo (1994). Efficient algorithms for discovering association rules. In Proceedings of the KDD Workshop, 181–192. AAAI Press. URL: https://www.aaai.org/Papers/Workshops/1994/WS-94-03/WS94-03-016.pdf. PDF. BibTeX.

@inproceedings{mannila_efficient_1994,
    author = {Mannila, Heikki and Toivonen, Hannu and Verkamo, A Inkeri},
    title = {Efficient Algorithms for Discovering Association Rules},
    url = {https://www.aaai.org/Papers/Workshops/1994/WS-94-03/WS94-03-016.pdf},
    pages = {181--192},
    booktitle = {Proceedings of the {KDD} Workshop},
    publisher = {{AAAI} Press},
    year = {1994}
}

Abstract.

Association rules are statements of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set W, then it has 1 also in column B". Agrawal, Imielinski, and Swami introduced the problem of mining association rules from large collections of data, and gave a method based on successive passes over the database. We give an improved algorithm for the problem. The method is based on careful combinatorial analysis of the information obtained in previous passes; this makes it possible to eliminate unnecessary candidate rules. Experiments on a university course enrollment database indicate that the method outperforms the previous one by a factor of 5. We also show that sampling is in general a very efficient way of finding such rules.

Selecting itemsets

0.1

Gallo, Arianna, Tijl De Bie, and Nello Cristianini (2007). MINI: mining informative non-redundant itemsets. In Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD '07, Lecture Notes in Computer Science, 438–445. Springer, Berlin, Heidelberg. doi:10.1007/978-3-540-74976-9_44. PDF. BibTeX.

@inproceedings{gallo_mini_2007,
    author = {Gallo, Arianna and Bie, Tijl De and Cristianini, Nello},
    title = {{MINI}: Mining Informative Non-redundant Itemsets},
    doi = {10.1007/978-3-540-74976-9_44},
    series = {Lecture Notes in Computer Science},
    pages = {438--445},
    booktitle = {Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, {PKDD} '07},
    publisher = {Springer, Berlin, Heidelberg},
    year = {2007}
}

Abstract.

Frequent itemset mining assists the data mining practitioner in searching for strongly associated items (and transactions) in large transaction databases. Since the number of frequent itemsets is usually extremely large and unmanageable for a human user, recent works have sought to define condensed representations of them, e.g. closed or maximal frequent itemsets. We argue that not only these methods often still fall short in sufficiently reducing of the output size, but they also output many redundant itemsets. In this paper we propose a philosophically new approach that resolves both these issues in a computationally tractable way. We present and empirically validate a statistically founded approach called MINI, to compress the set of frequent itemsets down to a list of informative and non-redundant itemsets.

0.2

Hämäläinen, Wilhelmiina and Matti Nykänen (2008). Efficient discovery of statistically significant association rules. In Proceedings of the 8th IEEE International Conference on Data Mining, ICDM '08, 203–212. IEEE Computer Society. doi:10.1109/ICDM.2008.144. PDF. BibTeX.

@inproceedings{hamalainen_efficient_2008,
    author = {Hämäläinen, Wilhelmiina and Nykänen, Matti},
    title = {Efficient Discovery of Statistically Significant Association Rules},
    doi = {10.1109/ICDM.2008.144},
    pages = {203--212},
    booktitle = {Proceedings of the 8th {IEEE} International Conference on Data Mining, {ICDM} '08},
    publisher = {{IEEE} Computer Society},
    year = {2008}
}

Abstract.

Searching statistically significant association rules is an important but neglected problem. Traditional association rules do not capture the idea of statistical dependence and the resulting rules can be spurious, while the most significant rules may be missing. This leads to erroneous models and predictions which often become expensive.The problem is computationally very difficult, because the significance is not a monotonic property. However, in this paper we prove several other properties, which can be used for pruning the search space. The properties are implemented in the StatApriori algorithm, which searches statistically significant, non-redundant association rules. Based on both theoretical and empirical observations, the resulting rules are very accurate compared to traditional association rules. In addition, StatApriori can work with extremely low frequencies, thus finding new interesting rules.

0.3

Webb, Geoffrey I. (2006). Discovering significant rules. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '06, 434–443. ACM Press. doi:10.1145/1150402.1150451. PDF. BibTeX.

@inproceedings{webb_discovering_2006,
    author = {Webb, Geoffrey I.},
    title = {Discovering significant rules},
    doi = {10.1145/1150402.1150451},
    pages = {434--443},
    booktitle = {Proceedings of the 12th {ACM} {SIGKDD} international conference on Knowledge discovery and data mining, {KDD} '06},
    publisher = {{ACM} Press},
    year = {2006}
}

Abstract.

In many applications, association rules will only be interesting if they represent non-trivial correlations between all constituent items. Numerous techniques have been developed that seek to avoid false discoveries. However, while all provide useful solutions to aspects of this problem, none provides a generic solution that is both flexible enough to accommodate varying definitions of true and false discoveries and powerful enough to provide strict control over the risk of false discoveries. This paper presents generic techniques that allow definitions of true and false discoveries to be specified in terms of arbitrary statistical hypothesis tests and which provide strict control over the experiment wise risk of false discoveries.

I) Related formalisms

a) Skypatterns

Ia.1

Börzsönyi, Stephan, Donald Kossmann, and Konrad Stocker (2001). The skyline operator. In Proceedings of the 17th International Conference on Data Engineering, ICDE '01, 421–430. IEEE Computer Society. doi:10.1109/ICDE.2001.914855. PDF. BibTeX.

@inproceedings{borzsonyi_skyline_2001,
    author = {Börzsönyi, Stephan and Kossmann, Donald and Stocker, Konrad},
    title = {The Skyline operator},
    doi = {10.1109/ICDE.2001.914855},
    pages = {421--430},
    booktitle = {Proceedings of the 17th International Conference on Data Engineering, {ICDE} '01},
    publisher = {{IEEE} Computer Society},
    year = {2001}
}

Abstract.

We propose to extend database systems by a Skyline operation. This operation filters out a set of interesting points from a potentially large set of data points. A point is interesting if it is not dominated by any other point. For example, a hotel might be interesting for somebody traveling to Nassau if no other hotel is both cheaper and closer to the beach. We show how SSL can be extended to pose Skyline queries, present and evaluate alternative algorithms to implement the Skyline operation, and show how this operation can be combined with other database operations, e.g., join.

Ia.2

Soulet, Arnaud et al. (2011). Mining dominant patterns in the sky. In Proceedings of the 11th IEEE International Conference on Data Mining, ICDM '11, 655–664. IEEE Computer Society. doi:10.1109/ICDM.2011.100. PDF. BibTeX.

@inproceedings{soulet_mining_2011,
    author = {Soulet, Arnaud and Raïssi, Chedy and Plantevit, Marc and Crémilleux, Bruno},
    title = {Mining Dominant Patterns in the Sky},
    doi = {10.1109/ICDM.2011.100},
    pages = {655--664},
    booktitle = {Proceedings of the 11th {IEEE} International Conference on Data Mining, {ICDM} '11},
    publisher = {{IEEE} Computer Society},
    year = {2011}
}

Abstract.

Pattern discovery is at the core of numerous data mining tasks. Although many methods focus on efficiency in pattern mining, they still suffer from the problem of choosing a threshold that influences the final extraction result. The goal of our study is to make the results of pattern mining useful from a user-preference point of view. To this end, we integrate into the pattern discovery process the idea of skyline queries in order to mine skyline patterns in a threshold-free manner. Because the skyline patterns satisfy a formal property of dominations, they not only have a global interest but also have semantics that are easily understood by the user. In this work, we first establish theoretical relationships between pattern condensed representations and skyline pattern mining. We also show that it is possible to compute automatically a subset of measures involved in the user query which allows the patterns to be condensed and thus facilitates the computation of the skyline patterns. This forms the basis for a novel approach to mining skyline patterns. We illustrate the efficiency of our approach over several data sets including a use case from chemo informatics and show that small sets of dominant patterns are produced under various measures.

Ia.3

Ugarte, Willy, Patrice Boizumault, Samir Loudni, and Bruno Crémilleux (2014). Computing skypattern cubes. In Proceedings of the 21st European Conference on Artificial Intelligence, ECAI '14, volume 263 of Frontiers in Artificial Intelligence and Applications, 903–908. IOS Press. doi:10.3233/978-1-61499-419-0-903. PDF. BibTeX.

@inproceedings{ugarte_computing_2014,
    author = {Ugarte, Willy and Boizumault, Patrice and Loudni, Samir and Crémilleux, Bruno},
    title = {Computing Skypattern Cubes},
    volume = {263},
    doi = {10.3233/978-1-61499-419-0-903},
    series = {Frontiers in Artificial Intelligence and Applications},
    pages = {903--908},
    booktitle = {Proceedings of the 21st European Conference on Artificial Intelligence, {ECAI} '14},
    publisher = {{IOS} Press},
    year = {2014}
}

Abstract.

We introduce skypattern cubes and propose an efficient bottom-up approach to compute them. Our approach relies on derivation rules collecting skypatterns of a parent node from its child nodes without any dominance test. Non-derivable skypatterns are computed on the fly thanks to Dynamic CSP. The bottom-up principle enables to provide a concise representation of the cube based on skypattern equivalence classes without any supplementary effort. Experiments show the effectiveness of our proposal.

b) Constraint programming (CP)

Ib.1

Crémilleux, Bruno and Arnaud Soulet (2008). Discovering knowledge from local patterns with global constraints. In Proceedings of the International Conference on Computational Science and Its Applications, ICCSA '08, Lecture Notes in Computer Science, 1242–1257. Springer. doi:10.1007/978-3-540-69848-7_99. PDF. BibTeX.

@inproceedings{cremilleux_discovering_2008,
    author = {Crémilleux, Bruno and Soulet, Arnaud},
    title = {Discovering Knowledge from Local Patterns with Global Constraints},
    doi = {10.1007/978-3-540-69848-7_99},
    series = {Lecture Notes in Computer Science},
    pages = {1242--1257},
    booktitle = {Proceedings of the International Conference on Computational Science and Its Applications, {ICCSA} '08},
    publisher = {Springer},
    year = {2008}
}

Abstract.

It is well known that local patterns are at the core of a lot of knowledge which may be discovered from data. Nevertheless, use of local patterns is limited by their huge number and computational costs. Several approaches (e.g., condensed representations, pattern set discovery) aim at selecting or grouping local patterns to provide a global view of the data. In this paper, we propose the idea of global constraints to write queries addressing global patterns as sets of local patterns. Usefulness of global constraints is to take into account relationships between local patterns, such relations expressing a user bias according to its expectation (e.g., search of exceptions, top-k patterns). We think that global constraints are a powerful way to get meaningful patterns. We propose the generic Approximate-and-Push approach to mine patterns under global constraints and we give a method for the case of the top-k patterns w.r.t. any measure. Experiments show its efficiency since it was not feasible to mine such patterns beforehand.

Ib.2

Guns, Tias, Siegfried Nijssen, and Luc De Raedt (2013). K-pattern set mining under constraints. IEEE Transactions on Knowledge and Data Engineering (TKDE), 25(2):402–418. doi:10.1109/TKDE.2011.204. PDF. BibTeX.

@article{guns_k-pattern_2013,
    author = {Guns, Tias and Nijssen, Siegfried and De Raedt, Luc},
    title = {k-Pattern Set Mining under Constraints},
    volume = {25},
    doi = {10.1109/TKDE.2011.204},
    pages = {402--418},
    number = {2},
    journal = {{IEEE} Transactions on Knowledge and Data Engineering ({TKDE})},
    year = {2013}
}

Abstract.

We introduce the problem of k-pattern set mining, concerned with finding a set of k related patterns under constraints. This contrasts to regular pattern mining, where one searches for many individual patterns. The k-pattern set mining problem is a very general problem that can be instantiated to a wide variety of well-known mining tasks including concept-learning, rule-learning, redescription mining, conceptual clustering and tiling. To this end, we formulate a large number of constraints for use in k-pattern set mining, both at the local level, that is, on individual patterns, and on the global level, that is, on the overall pattern set. Building general solvers for the pattern set mining problem remains a challenge. Here, we investigate to what extent constraint programming (CP) can be used as a general solution strategy. We present a mapping of pattern set constraints to constraints currently available in CP. This allows us to investigate a large number of settings within a unified framework and to gain insight in the possibilities and limitations of these solvers. This is important as it allows us to create guidelines in how to model new problems successfully and how to model existing problems more efficiently. It also opens up the way for other solver technologies.

Ib.3

Järvisalo, Matti (2011). Itemset mining as a challenge application for answer set enumeration. In Proceedings of the International Conference on Logic Programming and Nonmonotonic Reasoning, LPNMR '11, Lecture Notes in Computer Science, 304–310. Springer. doi:10.1007/978-3-642-20895-9_35. PDF. BibTeX.

@inproceedings{jarvisalo_itemset_2011,
    author = {Järvisalo, Matti},
    title = {Itemset Mining as a Challenge Application for Answer Set Enumeration},
    doi = {10.1007/978-3-642-20895-9_35},
    series = {Lecture Notes in Computer Science},
    pages = {304--310},
    booktitle = {Proceedings of the International Conference on Logic Programming and Nonmonotonic Reasoning, {LPNMR} '11},
    publisher = {Springer},
    year = {2011}
}

Abstract.

We present an initial exploration into the possibilities of applying current state-of-the-art answer set programming (ASP) tools—esp. conflict-driven answer set enumeration—for mining itemsets in 0-1 data. We evaluate a simple ASP-based approach experimentally and compare it to a recently proposed framework exploiting constraint programming (CP) solvers for itemset mining.

c) Formal Concept Analysis (FCA)

Ic.1

Lakhal, Lotfi and Gerd Stumme (2005). Efficient mining of association rules based on formal concept analysis. In Formal Concept Analysis: Foundations and Applications, Lecture Notes in Computer Science, pages 180–195. Springer. doi:10.1007/11528784_10. PDF. BibTeX.

@incollection{lakhal_efficient_2005,
    author = {Lakhal, Lotfi and Stumme, Gerd},
    title = {Efficient Mining of Association Rules Based on Formal Concept Analysis},
    series = {Lecture Notes in Computer Science},
    pages = {180--195},
    booktitle = {Formal Concept Analysis: Foundations and Applications},
    publisher = {Springer},
    year = {2005},
    doi = {10.1007/11528784_10}
}

Abstract.

Association rules are a popular knowledge discovery technique for warehouse basket analysis. They indicate which items of the warehouse are frequently bought together. The problem of association rule mining has first been stated in 1993. Five years later, several research groups discovered that this problem has a strong connection to Formal Concept Analysis (FCA). In this survey, we will first introduce some basic ideas of this connection along a specific algorithm, Titanic, and show how FCA helps in reducing the number of resulting rules without loss of information, before giving a general overview over the history and state of the art of applying FCA for association rule mining.

Ic.2

Valtchev, Petko, Rokia Missaoui, and Robert Godin (2004). Formal concept analysis for knowledge discovery and data mining: the new challenges. In Proceedings of the International Conference on Formal Concept Analysis, ICFCA '04, Lecture Notes in Computer Science, 352–371. Springer. doi:10.1007/978-3-540-24651-0_30. PDF. BibTeX.

@inproceedings{valtchev_formal_2004,
    author = {Valtchev, Petko and Missaoui, Rokia and Godin, Robert},
    title = {Formal Concept Analysis for Knowledge Discovery and Data Mining: The New Challenges},
    doi = {10.1007/978-3-540-24651-0_30},
    series = {Lecture Notes in Computer Science},
    pages = {352--371},
    booktitle = {Proceedings of the International Conference on Formal Concept Analysis, {ICFCA} '04},
    publisher = {Springer},
    year = {2004}
}

Abstract.

Data mining (DM) is the extraction of regularities from raw data, which are further transformed within the wider process of knowledge discovery in databases (KDD) into non-trivial facts intended to support decision making. Formal concept analysis (FCA) offers an appropriate framework for KDD, whereby our focus here is on its potential for DM support. A variety of mining methods powered by FCA have been published and the figures grow steadily, especially in the association rule mining (ARM) field. However, an analysis of current ARM practices suggests the impact of FCA has not reached its limits, i.e., appropriate FCA-based techniques could successfully apply in a larger set of situations. As a first step in the projected FCA expansion, we discuss the existing ARM methods, provide a set of guidelines for the design of novel ones, and list some open algorithmic issues on the FCA side. As an illustration, we propose two on-line methods computing the minimal generators of a closure system.

Ic.3

Wille, Rudolf (1997). Conceptual graphs and formal concept analysis. In Proceedings of the International Conference on Conceptual Structures, ICCS '97, Lecture Notes in Computer Science, 290–303. Springer. doi:10.1007/BFb0027878. PDF. BibTeX.

@inproceedings{wille_conceptual_1997,
    author = {Wille, Rudolf},
    title = {Conceptual graphs and formal concept analysis},
    doi = {10.1007/BFb0027878},
    series = {Lecture Notes in Computer Science},
    pages = {290--303},
    booktitle = {Proceedings of the International Conference on Conceptual Structures, {ICCS} '97},
    publisher = {Springer},
    year = {1997}
}

Abstract.

It is shown how Conceptual Graphs and Formal Concept Analysis may be combined to obtain a formalization of Elementary Logic which is useful for knowledge representation and processing. For this, a translation of conceptual graphs to formal contexts and concept lattices is described through an example. Using a suitable mathematization of conceptual graphs, basics of a unified mathematical theory for Elementary Logic are proposed.

II) Problem variants

a) Contrast sets, Emerging Patterns and Subgroup discovery (SD)

IIa.1

Atzmueller, Martin and Frank Puppe (2006). SD-map – a fast algorithm for exhaustive subgroup discovery. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD '06, Lecture Notes in Computer Science, 6–17. Springer. doi:10.1007/11871637_6. PDF. BibTeX.

@inproceedings{atzmueller_sd-map_2006,
    author = {Atzmueller, Martin and Puppe, Frank},
    title = {{SD}-Map – A Fast Algorithm for Exhaustive Subgroup Discovery},
    doi = {10.1007/11871637_6},
    series = {Lecture Notes in Computer Science},
    pages = {6--17},
    booktitle = {Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, {PKDD} '06},
    publisher = {Springer},
    year = {2006}
}

Abstract.

In this paper we present the novel SD-Map algorithm for exhaustive but efficient subgroup discovery. SD-Map guarantees to identify all interesting subgroup patterns contained in a data set, in contrast to heuristic or sampling-based methods. The SD-Map algorithm utilizes the well-known FP-growth method for mining association rules with adaptations for the subgroup discovery task. We show how SD-Map can handle missing values, and provide an experimental evaluation of the performance of the algorithm using synthetic data.

IIa.2

Bay, Stephen D. and Michael J. Pazzani (1999). Detecting change in categorical data: mining contrast sets. In Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '99, 302–306. ACM Press. doi:10.1145/312129.312263. PDF. BibTeX.

@inproceedings{bay_detecting_1999,
    author = {Bay, Stephen D. and Pazzani, Michael J.},
    title = {Detecting change in categorical data: mining contrast sets},
    doi = {10.1145/312129.312263},
    pages = {302--306},
    booktitle = {Proceedings of the 5th {ACM} {SIGKDD} international conference on Knowledge discovery and data mining, {KDD} '99},
    publisher = {{ACM} Press},
    year = {1999}
}

Abstract.

A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 versus 1998. We present the problem of mining contrast-sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups. We provide an algorithm for mining contrast-sets as well as several pruning rules to reduce the computational complexity. Once the deviations are found, we post-process the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.

IIa.3

Dong, Guozhu and Jinyan Li (1999). Efficient mining of emerging patterns: discovering trends and differences. In Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '99, 43–52. ACM Press. doi:10.1145/312129.312191. PDF. BibTeX.

@inproceedings{dong_efficient_1999,
    author = {Dong, Guozhu and Li, Jinyan},
    title = {Efficient mining of emerging patterns: discovering trends and differences},
    doi = {10.1145/312129.312191},
    pages = {43--52},
    booktitle = {Proceedings of the 5th {ACM} {SIGKDD} international conference on Knowledge discovery and data mining, {KDD} '99},
    publisher = {{ACM} Press},
    year = {1999}
}

Abstract.

We introduce a new kind of patterns, called emerging patterns (EPs), for knowledge discovery from databases. EPs are defined as itemsets whose supports increase significantly from one dataset to another. EPs can capture emerging trends in timestamped databases, or useful contrasts between data classes. EPs have been proven useful: we have used them to build very powerful classifiers, which are more accurate than C4.5 and CBA, for many datasets. We believe that EPs with low to medium support, such as 1%-20%, can give useful new insights and guidance to experts, in even “well understood” applications. The efficient mining of EPs is a challenging problem, since (i) the Apriori property no longer holds for EPs, and (ii) there are usually too many candidates for high dimensional databases or for small support thresholds such as 0.5%. Naive algorithms are too costly. To solve this problem, (a) we promote the description of large collections of itemsets using their concise borders (the pair of sets of the minimal and of the maximal itemsets in the collections). (b) We design EP mining algorithms which manipulate only borders of collections (especially using our multiborder- differential algorithm), and which represent discovered EPs using borders. All EPs satisfying a constraint can be efficiently discovered by our border-based algorithms, which take the borders, derived by Max-Miner, of large itemsets as inputs. In our experiments on large and high dimensional datasets including the US census and Mushroom datasets, many EPs, including some with large cardinality, are found quickly. We also give other algorithms for discovering general or special types of EPs.

b) Pattern Aided Regression (PXR) and Exceptional Model Mining (EMM)

IIb.1

Dong, Guozhu and Vahid Taslimitehrani (2015). Pattern-aided regression modeling and prediction model analysis. IEEE Transactions on Knowledge and Data Engineering (TKDE), 27(9):2452–2465. doi:10.1109/TKDE.2015.2411609. PDF. BibTeX.

@article{dong_pattern-aided_2015,
    author = {Dong, Guozhu and Taslimitehrani, Vahid},
    title = {Pattern-Aided Regression Modeling and Prediction Model Analysis},
    volume = {27},
    doi = {10.1109/TKDE.2015.2411609},
    pages = {2452--2465},
    number = {9},
    journal = {{IEEE} Transactions on Knowledge and Data Engineering ({TKDE})},
    year = {2015}
}

Abstract.

This paper first introduces pattern aided regression (PXR) models, a new type of regression models designed to represent accurate and interpretable prediction models. This was motivated by two observations: (1) Regression modeling applications often involve complex diverse predictor-response relationships, which occur when the optimal regression models (of given regression model type) fitting two or more distinct logical groups of data are highly different. (2) State-of-the-art regression methods are often unable to adequately model such relationships. This paper defines PXR models using several patterns and local regression models, which respectively serve as logical and behavioral characterizations of distinct predictor-response relationships. The paper also introduces a contrast pattern aided regression (CPXR) method, to build accurate PXR models. In experiments, the PXR models built by CPXR are very accurate in general, often outperforming state-of-the-art regression methods by big margins. Usually using (a) around seven simple patterns and (b) linear local regression models, those PXR models are easy to interpret; in fact, their complexity is just a bit higher than that of (piecewise) linear regression models and is significantly lower than that of traditional ensemble based regression models. CPXR is especially effective for high-dimensional data. The paper also discusses how to use CPXR methodology for analyzing prediction models and correcting their prediction errors.

IIb.2

Duivesteijn, Wouter, Ad Feelders, and Arno Knobbe (2012). Different slopes for different folks: mining for exceptional regression models with cook's distance. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '12, 868–876. ACM Press. doi:10.1145/2339530.2339668. PDF. BibTeX.

@inproceedings{duivesteijn_different_2012,
    author = {Duivesteijn, Wouter and Feelders, Ad and Knobbe, Arno},
    title = {Different slopes for different folks: mining for exceptional regression models with cook's distance},
    doi = {10.1145/2339530.2339668},
    pages = {868--876},
    booktitle = {Proceedings of the 18th {ACM} {SIGKDD} international conference on Knowledge discovery and data mining, {KDD} '12},
    publisher = {{ACM} Press},
    year = {2012}
}

Abstract.

Exceptional Model Mining (EMM) is an exploratory data analysis technique that can be regarded as a generalization of subgroup discovery. In EMM we look for subgroups of the data for which a model fitted to the subgroup differs substantially from the same model fitted to the entire dataset. In this paper we develop methods to mine for exceptional regression models. We propose a measure for the exceptionality of regression models (Cook's distance), and explore the possibilities to avoid having to fit the regression model to each candidate subgroup. The algorithm is evaluated on a number of real life datasets. These datasets are also used to illustrate the results of the algorithm. We find interesting subgroups with deviating models on datasets from several different domains. We also show that under certain circumstances one can forego fitting regression models on up to 40% of the subgroups, and these 40% are the relatively expensive regression models to compute.

IIb.3

Leman, Dennis, Ad Feelders, and Arno Knobbe (2008). Exceptional model mining. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD '08, Lecture Notes in Computer Science, 1–16. Springer. doi:10.1007/978-3-540-87481-2_1. PDF. BibTeX.

@inproceedings{leman_exceptional_2008,
    author = {Leman, Dennis and Feelders, Ad and Knobbe, Arno},
    title = {Exceptional Model Mining},
    doi = {10.1007/978-3-540-87481-2_1},
    series = {Lecture Notes in Computer Science},
    pages = {1--16},
    booktitle = {Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, {ECML} {PKDD} '08},
    publisher = {Springer},
    year = {2008}
}

Abstract.

In most databases, it is possible to identify small partitions of the data where the observed distribution is notably different from that of the database as a whole. In classical subgroup discovery, one considers the distribution of a single nominal attribute, and exceptional subgroups show a surprising increase in the occurrence of one of its values. In this paper, we introduce Exceptional Model Mining (EMM), a framework that allows for more complicated target concepts. Rather than finding subgroups based on the distribution of a single target attribute, EMM finds subgroups where a model fitted to that subgroup is somehow exceptional. We discuss regression as well as classification models, and define quality measures that determine how exceptional a given model on a subgroup is. Our framework is general enough to be applied to many types of models, even from other paradigms such as association analysis and graphical modeling.

c) Redescription mining

IIc.1

Galbrun, Esther and Pauli Miettinen (2011). From black and white to full colour: extending redescription mining outside the boolean world. In Proceedings of the 2011 SIAM International Conference on Data Mining, SDM '11, 546–557. SIAM. doi:10.1137/1.9781611972818.47. PDF. BibTeX.

@inproceedings{galbrun_black_2011,
    author = {Galbrun, Esther and Miettinen, Pauli},
    title = {From Black and White to Full Colour: Extending Redescription Mining Outside the Boolean World},
    doi = {10.1137/1.9781611972818.47},
    pages = {546--557},
    booktitle = {Proceedings of the 2011 {SIAM} International Conference on Data Mining, {SDM} '11},
    publisher = {{SIAM}},
    year = {2011}
}

Abstract.

Given a 0–1 dataset, we consider the redescription mining task introduced by Ramakrishnan, Parida, and Zaki. The problem is to find subsets of the rows that can be (approximately) defined by at least two different Boolean formulae on the attributes. That is, we search for pairs (α, β) of Boolean formulae such that the implications α → β and β → α both hold with high accuracy. We require that the two descriptions α and β are syntactically sufficiently different. Such pairs of descriptions indicate that the subset has different definitions, a fact that gives useful information about the data. We give simple algorithms for this task, and evaluate their performance. The methods are based on pruning the search space of all possible pairs of formulae by different accuracy criteria. The significance of the findings is tested by using randomization methods. Experimental results on simulated and real data show that the methods work well: on simulated data they find the planted subsets, and on real data they produce small and understandable results.

IIc.2

Parida, Laxmi and Naren Ramakrishnan (2005). Redescription mining: structure theory and algorithms. In Proceedings of the 20th National Conference on Artificial Intelligence and the 7th Innovative Applications of Artificial Intelligence Conference, AAAI '05, 837–844. AAAI Press / The MIT Press. URL: https://www.aaai.org/Papers/AAAI/2005/AAAI05-132.pdf. PDF. BibTeX.

@inproceedings{parida_redescription_2005,
    author = {Parida, Laxmi and Ramakrishnan, Naren},
    title = {Redescription Mining: Structure Theory and Algorithms},
    url = {https://www.aaai.org/Papers/AAAI/2005/AAAI05-132.pdf},
    pages = {837--844},
    booktitle = {Proceedings of the 20th National Conference on Artificial Intelligence and the 7th Innovative Applications of Artificial Intelligence Conference, {AAAI} '05},
    publisher = {{AAAI} Press / The {MIT} Press},
    year = {2005}
}

Abstract.

We introduce a new data mining problem—redescription mining—that unifies considerations of conceptual clustering, constructive induction, and logical formula discovery. Redescription mining begins with a collection of sets, views it as a propositional vocabulary, and identifies clusters of data that can be defined in at least two ways using this vocabulary. The primary contributions of this paper are conceptual and theoretical: (i) we formally study the space of redescriptions underlying a dataset and characterize their intrinsic structure, (ii) we identify impossibility as well as strong possibility results about when mining redescriptions is feasible, (iii) we present several scenarios of how we can custom-build redescription mining solutions for various biases, and (iv) we outline how many problems studied in the larger machine learning community are really special cases of redescription mining. By highlighting its broad scope and relevance, we aim to establish the importance of redescription mining and make the case for a thrust in this new line of research.

IIc.3

Ramakrishnan, Naren et al. (2004). Turning CARTwheels: an alternating algorithm for mining redescriptions. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '04, 266–275. ACM Press. doi:10.1145/1014052.1014083. PDF. BibTeX.

@inproceedings{ramakrishnan_turning_2004,
    author = {Ramakrishnan, Naren and Kumar, Deept and Mishra, Bud and Potts, Malcolm and Helm, Richard F},
    title = {Turning {CARTwheels}: an alternating algorithm for mining redescriptions},
    doi = {10.1145/1014052.1014083},
    pages = {266--275},
    booktitle = {Proceedings of the 10th {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining, {KDD} '04},
    publisher = {{ACM} Press},
    year = {2004}
}

Abstract.

We present an unusual algorithm involving classification trees—CARTwheels—where two trees are grown in opposite directions so that they are joined at their leaves. This approach finds application in a new data mining task we formulate, called redescription mining. A redescription is a shift-of-vocabulary, or a different way of communicating information about a given subset of data; the goal of redescription mining is to find subsets of data that afford multiple descriptions. We highlight the importance of this problem in domains such as bioinformatics, which exhibit an underlying richness and diversity of data descriptors (e.g., genes can be studied in a variety of ways). CARTwheels exploits the duality between class partitions and path partitions in an induced classification tree to model and mine redescriptions. It helps integrate multiple forms of characterizing datasets, situates the knowledge gained from one dataset in the context of others, and harnesses high-level abstractions for uncovering cryptic and subtle features of data. Algorithm design decisions, implementation details, and experimental results are presented.

III) Advanced pattern evaluation techniques

a) Randomization

IIIa.1

Gionis, Aristides et al. (2007). Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(3):14. doi:10.1145/1297332.1297338. PDF. BibTeX.

@article{gionis_assessing_2007,
    author = {Gionis, Aristides and Mannila, Heikki and Mielikäinen, Taneli and Tsaparas, Panayiotis},
    title = {Assessing data mining results via swap randomization},
    volume = {1},
    doi = {10.1145/1297332.1297338},
    pages = {14},
    number = {3},
    journal = {{ACM} Transactions on Knowledge Discovery from Data ({TKDD})},
    year = {2007}
}

Abstract.

The problem of assessing the significance of data mining results on high-dimensional 0--1 datasets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by standard statistical tests such as chi-square, or other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are difficult to apply to sets of patterns or other complex results of data mining algorithms. In this article, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins as the given dataset, computing the results of interest on the randomized instances and comparing them to the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and spectral analysis. To generate random datasets with given margins, we use variations of a Markov chain approach which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well-known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is expected, given the row and column margins of the datasets, while for other datasets the discovered structure conveys information that is not captured by the margin counts.

IIIa.2

Hanhijärvi, Sami et al. (2009). Tell me something i don't know: randomization strategies for iterative data mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '09, 379–388. ACM Press. doi:10.1145/1557019.1557065. PDF. BibTeX.

@inproceedings{hanhijarvi_tell_2009,
    author = {Hanhijärvi, Sami and Ojala, Markus and Vuokko, Niko and Puolamäki, Kai and Tatti, Nikolaj and Mannila, Heikki},
    title = {Tell Me Something I Don't Know: Randomization Strategies for Iterative Data Mining},
    doi = {10.1145/1557019.1557065},
    pages = {379--388},
    booktitle = {Proceedings of the 15th {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining, {KDD} '09},
    publisher = {{ACM} Press},
    year = {2009}
}

Abstract.

There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

IIIa.3

Strona, Giovanni, Domenico Nappo, et al. (2014). A fast and unbiased procedure to randomize ecological binary matrices with fixed row and column totals. Nature Communications, 5(1):4114. doi:10.1038/ncomms5114. PDF. BibTeX.

@article{strona_fast_2014,
    author = {Strona, Giovanni and Nappo, Domenico and Boccacci, Francesco and Fattorini, Simone and San-Miguel-Ayanz, Jesus},
    title = {A fast and unbiased procedure to randomize ecological binary matrices with fixed row and column totals},
    volume = {5},
    doi = {10.1038/ncomms5114},
    pages = {4114},
    number = {1},
    journal = {Nature Communications},
    year = {2014}
}

Abstract.

A well-known problem in numerical ecology is how to recombine presence-absence matrices without altering row and column totals. A few solutions have been proposed, but all of them present some issues in terms of statistical robustness (that is, their capability to generate different matrix configurations with the same probability) and their performance (that is, the computational effort that they require to generate a null matrix). Here we introduce the ‘Curveball algorithm’, a new procedure that differs from existing methods in that it focuses rather on matrix information content than on matrix structure. We demonstrate that the algorithm can sample uniformly the set of all possible matrix configurations requiring a computational effort orders of magnitude lower than that required by available methods, making it possible to easily randomize matrices larger than 108 cells.

b) Minimum Description Length principle

IIIb.1

Chakrabarti, Deepayan et al. (2004). Fully automatic cross-associations. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '04, 79–88. ACM Press. doi:10.1145/1014052.1014064. PDF. BibTeX.

@inproceedings{chakrabarti_fully_2004,
    author = {Chakrabarti, Deepayan and Papadimitriou, Spiros and Modha, Dharmendra S. and Faloutsos, Christos},
    title = {Fully Automatic Cross-associations},
    doi = {10.1145/1014052.1014064},
    pages = {79--88},
    booktitle = {Proceedings of the 10th {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining, {KDD} '04},
    publisher = {{ACM} Press},
    year = {2004}
}

Abstract.

Large, sparse binary matrices arise in numerous data mining applications, such as the analysis of market baskets, web graphs, social networks, co-citations, as well as information retrieval, collaborative filtering, sparse matrix reordering, etc. Virtually all popular methods for the analysis of such matrices—e.g., k-means clustering, METIS graph partitioning, SVD/PCA and frequent itemset mining—require the user to specify various parameters, such as the number of clusters, number of principal components, number of partitions, and "support." Choosing suitable values for such parameters is a challenging problem.Cross-association is a joint decomposition of a binary matrix into disjoint row and column groups such that the rectangular intersections of groups are homogeneous. Starting from first principles, we furnish a clear, information-theoretic criterion to choose a good cross-association as well as its parameters, namely, the number of row and column groups. We provide scalable algorithms to approach the optimal. Our algorithm is parameter-free, and requires no user intervention. In practice it scales linearly with the problem size, and is thus applicable to very large matrices. Finally, we present experiments on multiple synthetic and real-life datasets, where our method gives high-quality, intuitive results.

IIIb.2

Siebes, Arno (2014). MDL in pattern mining a brief introduction to krimp. In Proceedings of the international conference on Formal Concept Analysis, FCA '14, Lecture Notes in Computer Science, 37–43. Springer International Publishing. doi:10.1007/978-3-319-07248-7_3. PDF. BibTeX.

@inproceedings{siebes_mdl_2014,
    author = {Siebes, Arno},
    title = {{MDL} in Pattern Mining A Brief Introduction to Krimp},
    doi = {10.1007/978-3-319-07248-7_3},
    series = {Lecture Notes in Computer Science},
    pages = {37--43},
    booktitle = {Proceedings of the international conference on Formal Concept Analysis, {FCA} '14},
    publisher = {Springer International Publishing},
    year = {2014}
}

Abstract.

In this short paper we sketch a brief introduction to our Krimp algorithm. Moreover, we briefly discuss some of the large body of follow up research. Pointers to the relevant papers are provided in the bibliography.

IIIb.3

Vreeken, Jilles, Matthijs Leeuwen, and Arno Siebes (2011). Krimp: mining itemsets that compress. Data Mining and Knowledge Discovery, 23(1):169–214. doi:10.1007/s10618-010-0202-x. PDF. BibTeX.

@article{vreeken_krimp_2011,
    author = {Vreeken, Jilles and Leeuwen, Matthijs and Siebes, Arno},
    title = {Krimp: mining itemsets that compress},
    volume = {23},
    doi = {10.1007/s10618-010-0202-x},
    pages = {169--214},
    number = {1},
    journal = {Data Mining and Knowledge Discovery},
    year = {2011}
}

Abstract.

One of the major problems in pattern mining is the explosion of the number of results. Tight constraints reveal only common knowledge, while loose constraints lead to an explosion in the number of returned patterns. This is caused by large groups of patterns essentially describing the same set of transactions. In this paper we approach this problem using the MDL principle: the best set of patterns is that set that compresses the database best. For this task we introduce the Krimp algorithm. Experimental evaluation shows that typically only hundreds of itemsets are returned; a dramatic reduction, up to seven orders of magnitude, in the number of frequent item sets. These selections, called code tables, are of high quality. This is shown with compression ratios, swap-randomisation, and the accuracies of the code table-based Krimp classifier, all obtained on a wide range of datasets. Further, we extensively evaluate the heuristic choices made in the design of the algorithm.

c) Subjective Interestingness and Maximum Entropy models

IIIc.1

De Bie, Tijl (2011a). An information theoretic framework for data mining. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '11, 564–572. ACM Press. doi:10.1145/2020408.2020497. PDF. BibTeX.

@inproceedings{de_bie_information_2011,
    author = {De Bie, Tijl},
    title = {An information theoretic framework for data mining},
    doi = {10.1145/2020408.2020497},
    pages = {564--572},
    booktitle = {Proceedings of the 17th {ACM} {SIGKDD} international conference on Knowledge discovery and data mining, {KDD} '11},
    publisher = {{ACM} Press},
    year = {2011}
}

Abstract.

We formalize the data mining process as a process of information exchange, defined by the following key components. The data miner's state of mind is modeled as a probability distribution, called the background distribution, which represents the uncertainty and misconceptions the data miner has about the data. This model initially incorporates any prior (possibly incorrect) beliefs a data miner has about the data. During the data mining process, properties of the data (to which we refer as patterns) are revealed to the data miner, either in batch, one by one, or even interactively. This acquisition of information in the data mining process is formalized by updates to the background distribution to account for the presence of the found patterns. The proposed framework can be motivated using concepts from information theory and game theory. Understanding it from this perspective, it is easy to see how it can be extended to more sophisticated settings, e.g. where patterns are probabilistic functions of the data (thus allowing one to account for noise and errors in the data mining process, and allowing one to study data mining techniques based on subsampling the data). The framework then models the data mining process using concepts from information geometry, and I-projections in particular. The framework can be used to help in designing new data mining algorithms that maximize the efficiency of the information exchange from the algorithm to the data miner.

IIIc.2

De Bie, Tijl (2011b). Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Mining and Knowledge Discovery, 23(3):407–446. doi:10.1007/s10618-010-0209-3. PDF. BibTeX.

@article{de_bie_maximum_2011,
    author = {De Bie, Tijl},
    title = {Maximum entropy models and subjective interestingness: an application to tiles in binary databases},
    volume = {23},
    doi = {10.1007/s10618-010-0209-3},
    pages = {407--446},
    number = {3},
    journal = {Data Mining and Knowledge Discovery},
    year = {2011}
}

Abstract.

Recent research has highlighted the practical benefits of subjective interestingness measures, which quantify the novelty or unexpectedness of a pattern when contrasted with any prior information of the data miner (Silberschatz and Tuzhilin, Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), 1995; Geng and Hamilton, ACM Comput Surv 38(3):9, 2006). A key challenge here is the formalization of this prior information in a way that lends itself to the definition of a subjective interestingness measure that is both meaningful and practical. In this paper, we outline a general strategy of how this could be achieved, before working out the details for a use case that is important in its own right. Our general strategy is based on considering prior information as constraints on a probabilistic model representing the uncertainty about the data. More specifically, we represent the prior information by the maximum entropy (MaxEnt) distribution subject to these constraints. We briefly outline various measures that could subsequently be used to contrast patterns with this MaxEnt model, thus quantifying their subjective interestingness. We demonstrate this strategy for rectangular databases with knowledge of the row and column sums. This situation has been considered before using computation intensive approaches based on swap randomizations, allowing for the computation of empirical p-values as interestingness measures (Gionis et al., ACM Trans Knowl Discov Data 1(3):14, 2007). We show how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently. More importantly, being an explicitly represented distribution, the MaxEnt model can additionally be used to define analytically computable interestingness measures, as we demonstrate for tiles (Geerts et al., Proceedings of the 7th international conference on Discovery science (DS04), 2004) in binary databases.

IIIc.3

Tatti, Nikolaj (2007). Maximum entropy based significance of itemsets. In Proceedings of the 7th IEEE International Conference on Data Mining, ICDM '07, ICDM '07, 312–321. IEEE Computer Society. doi:10.1109/ICDM.2007.43. PDF. BibTeX.

@inproceedings{tatti_maximum_2007,
    author = {Tatti, Nikolaj},
    title = {Maximum Entropy Based Significance of Itemsets},
    doi = {10.1109/ICDM.2007.43},
    series = {{ICDM} '07},
    pages = {312--321},
    booktitle = {Proceedings of the 7th {IEEE} International Conference on Data Mining, {ICDM} '07},
    publisher = {{IEEE} Computer Society},
    year = {2007}
}

Abstract.

We consider the problem of defining the significance of an itemset. We say that the itemset is significant if we are surprised by its frequency when compared to the frequencies of its sub-itemsets. In other words, we estimate the frequency of the itemset from the frequencies of its sub-itemsets and compute the deviation between the real value and the estimate. For the estimation we use Maximum Entropy and for measuring the deviation we use Kullback-Leibler divergence. A major advantage compared to the previous methods is that we are able to use richer models whereas the previous approaches only measure the deviation from the independence model. We show that our measure of significance goes to zero for derivable itemsets and that we can use the rank as a statistical test. Our empirical results demonstrate that for our real datasets the independence assumption is too strong but applying more flexible models leads to good results.

@article{bay_detecting_2001,
    author = {Bay, Stephen D. and Pazzani, Michael J.},
    title = {Detecting Group Differences: Mining Contrast Sets},
    volume = {5},
    doi = {10.1023/A:1011429418057},
    pages = {213--246},
    number = {3},
    journal = {Data Mining and Knowledge Discovery},
    year = {2001}
}

Abstract.

A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mining contrast sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups. We provide a search algorithm for mining contrast sets with pruning rules that drastically reduce the computational complexity. Once the contrast sets are found, we post-process the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.

Topics: contrast sets, IIa.

2

De Raedt, Luc (2002). A perspective on inductive databases. ACM SIGKDD Explorations Newsletter, 4(2):69–77. doi:10.1145/772862.772871. PDF. BibTeX.

@article{de_raedt_perspective_2002,
    author = {De Raedt, Luc},
    title = {A perspective on inductive databases},
    volume = {4},
    doi = {10.1145/772862.772871},
    pages = {69--77},
    number = {2},
    journal = {{ACM} {SIGKDD} Explorations Newsletter},
    year = {2002}
}

Abstract.

Inductive databases tightly integrate databases with data mining. The key ideas are that data and patterns (or models) are handled in the same way and that an inductive query language allows the user to query and manipulate the patterns (or models) of interest.This paper proposes a simple and abstract model for inductive databases. We describe the basic formalism, a simple but fairly powerful inductive query language, some basics of reasoning for query optimization, and discuss some memory organization and implementation issues.

Topics: constraints.

3

De Raedt, Luc, Tias Guns, and Siegfried Nijssen (2008). Constraint programming for itemset mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '08, 204–212. ACM Press. doi:10.1145/1401890.1401919. PDF. BibTeX.

@inproceedings{de_raedt_constraint_2008,
    author = {De Raedt, Luc and Guns, Tias and Nijssen, Siegfried},
    title = {Constraint programming for itemset mining},
    doi = {10.1145/1401890.1401919},
    pages = {204--212},
    booktitle = {Proceedings of the 14th {ACM} {SIGKDD} international conference on Knowledge discovery and data mining, {KDD} '08},
    publisher = {{ACM} Press},
    year = {2008}
}

Abstract.

The relationship between constraint-based mining and constraint programming is explored by showing how the typical constraints used in pattern mining can be formulated for use in constraint programming environments. The resulting framework is surprisingly flexible and allows us to combine a wide range of mining constraints in different ways. We implement this approach in off-the-shelf constraint programming systems and evaluate it empirically. The results show that the approach is not only very expressive, but also works well on complex benchmark problems.

Topics: constraints.

4

De Raedt, Luc and Albrecht. Zimmermann (2007). Constraint-based pattern set mining. In Proceedings of the 2007 SIAM International Conference on Data Mining, SDM '07, 237–248. SIAM. doi:10.1137/1.9781611972771.22. PDF. BibTeX.

@inproceedings{de_raedt_constraint-based_2007,
    author = {De Raedt, Luc and Zimmermann, Albrecht.},
    title = {Constraint-Based Pattern Set Mining},
    doi = {10.1137/1.9781611972771.22},
    pages = {237--248},
    booktitle = {Proceedings of the 2007 {SIAM} International Conference on Data Mining, {SDM} '07},
    publisher = {{SIAM}},
    year = {2007}
}

Abstract.

Local pattern mining algorithms generate sets of patterns, which are typically not directly useful and have to be further processed before actual application or interpretation. Rather than investigating each pattern individually at the local level, we propose to mine for global models directly. A global model is essentially a pattern set that is interpreted as a disjunction of these patterns. It becomes possible to specify constraints at the level of the pattern sets of interest. This idea leads to the development of a constraint-based mining and inductive querying approach for global pattern mining. We introduce various natural types of constraints, discuss their properties, and show how they can be used for pattern set mining. A key contribution is that we show how well-known properties from local pattern mining, such as monotonicity and anti-monotonicity, can be adapted for use in pattern set mining. This, in turn, then allows us to adapt existing algorithms for item-set mining to pattern set mining. Two algorithms are presented, one level-wise algorithm that mines for all pattern sets that satisfy a conjunction of a monotonic and an anti-monotonic constraint, and an algorithm that adds the capability of asking topk queries, We also report on a case study regarding classification rule selection using this new technique.

Topics: constraints.

5

Downar, Lennart and Wouter Duivesteijn (2017). Exceptionally monotone models—the rank correlation model class for exceptional model mining. Knowledge and Information Systems, 51(2):369–394. doi:10.1007/s10115-016-0979-z. PDF. BibTeX.

@article{downar_exceptionally_2017,
    author = {Downar, Lennart and Duivesteijn, Wouter},
    title = {Exceptionally monotone models—the rank correlation model class for Exceptional Model Mining},
    volume = {51},
    doi = {10.1007/s10115-016-0979-z},
    pages = {369--394},
    number = {2},
    journal = {Knowledge and Information Systems},
    year = {2017}
}

Abstract.

Exceptional Model Mining strives to find coherent subgroups of the dataset where multiple target attributes interact in an unusual way. One instance of such an investigated form of interaction is Pearson’s correlation coefficient between two targets. EMM then finds subgroups with an exceptionally linear relation between the targets. In this paper, we enrich the EMM toolbox by developing the more general rank correlation model class. We find subgroups with an exceptionally monotone relation between the targets. Apart from catering for this richer set of relations, the rank correlation model class does not necessarily require the assumption of target normality, which is implicitly invoked in the Pearson’s correlation model class. Furthermore, it is less sensitive to outliers. We provide pseudocode for the employed algorithm and analyze its computational complexity, and experimentally illustrate what the rank correlation model class for EMM can find for you on six datasets from an eclectic variety of domains.

Topics: EMM, IIb.

6

Duivesteijn, Wouter, Ad Feelders, and Arno Knobbe (2016). Exceptional model mining. Data Mining and Knowledge Discovery, 30(1):47–98. doi:10.1007/s10618-015-0403-4. PDF. BibTeX.

@article{duivesteijn_exceptional_2016,
    author = {Duivesteijn, Wouter and Feelders, Ad and Knobbe, Arno},
    title = {Exceptional Model Mining},
    volume = {30},
    doi = {10.1007/s10618-015-0403-4},
    pages = {47--98},
    number = {1},
    journal = {Data Mining and Knowledge Discovery},
    year = {2016}
}

Abstract.

Finding subsets of a dataset that somehow deviate from the norm, i.e. where something interesting is going on, is a classical Data Mining task. In traditional local pattern mining methods, such deviations are measured in terms of a relatively high occurrence (frequent itemset mining), or an unusual distribution for one designated target attribute (common use of subgroup discovery). These, however, do not encompass all forms of “interesting”. To capture a more general notion of interestingness in subsets of a dataset, we develop Exceptional Model Mining (EMM). This is a supervised local pattern mining framework, where several target attributes are selected, and a model over these targets is chosen to be the target concept. Then, we strive to find subgroups: subsets of the dataset that can be described by a few conditions on single attributes. Such subgroups are deemed interesting when the model over the targets on the subgroup is substantially different from the model on the whole dataset. For instance, we can find subgroups where two target attributes have an unusual correlation, a classifier has a deviating predictive performance, or a Bayesian network fitted on several target attributes has an exceptional structure. We give an algorithmic solution for the EMM framework, and analyze its computational complexity. We also discuss some illustrative applications of EMM instances, including using the Bayesian network model to identify meteorological conditions under which food chains are displaced, and using a regression model to find the subset of households in the Chinese province of Hunan that do not follow the general economic law of demand.

Topics: EMM, IIb.

7

Ferré, Sébastien and Olivier Ridoux (2000). A logical generalization of formal concept analysis. In Proceedings of the International Conference on Conceptual Structures, ICCS '00, Lecture Notes in Computer Science, 371–384. Springer. doi:10.1007/10722280_26. PDF. BibTeX.

@inproceedings{ferre_logical_2000,
    author = {Ferré, Sébastien and Ridoux, Olivier},
    title = {A Logical Generalization of Formal Concept Analysis},
    doi = {10.1007/10722280_26},
    series = {Lecture Notes in Computer Science},
    pages = {371--384},
    booktitle = {Proceedings of the International Conference on Conceptual Structures, {ICCS} '00},
    publisher = {Springer},
    year = {2000}
}

Abstract.

We propose a generalization of Formal Concept Analysis (FCA) in which sets of attributes are replaced by expressions of an almost arbitrary logic. We prove that all FCA can be reconstructed on this basis. We show that from any logic that is used in place of sets of attributes can be derived a contextualized logic that takes into account the formal context and that is isomorphic to the concept lattice. We then justify the generalization of FCA compared with existing extensions and in the perspective of its application to information systems.

Topics: FCA.

8

Galbrun, Esther (2022). The minimum description length principle for pattern mining: a survey. Data Mining and Knowledge Discovery. doi:10.1007/s10618-022-00846-z. PDF. BibTeX.

@article{galbrun_minimum_2022,
    author = {Galbrun, Esther},
    title = {The minimum description length principle for pattern mining: a survey},
    doi = {10.1007/s10618-022-00846-z},
    journal = {Data Mining and Knowledge Discovery},
    year = {2022}
}

Abstract.

Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The Minimum Description Length (MDL) principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, we review MDL-based methods for mining different kinds of patterns from various types of data. Finally, we open a discussion on some issues regarding these methods.

Topics: MDL principle.

9

Galbrun, Esther and Pauli Miettinen (2012). From black and white to full color: extending redescription mining outside the boolean world. Statistical Analysis and Data Mining, 5(4):284–303. doi:10.1002/sam.11145. PDF. BibTeX.

@article{galbrun_black_2012,
    author = {Galbrun, Esther and Miettinen, Pauli},
    title = {From black and white to full color: Extending redescription mining outside the Boolean world},
    volume = {5},
    doi = {10.1002/sam.11145},
    pages = {284--303},
    number = {4},
    journal = {Statistical Analysis and Data Mining},
    year = {2012}
}

Abstract.

Redescription mining is a powerful data analysis tool that is used to find multiple descriptions of the same entities. Consider geographical regions as an example. They can be characterized by the fauna that inhabits them on one hand and by their meteorological conditions on the other hand. Finding such redescriptors, a task known as niche‐finding, is of much importance in biology. Current redescription mining methods cannot handle other than Boolean data. This restricts the range of possible applications or makes discretization a pre‐requisite, entailing a possibly harmful loss of information. In niche‐finding, while the fauna can be naturally represented using a Boolean presence/absence data, the weather cannot. In this paper, we extend redescription mining to categorical and real‐valued data with possibly missing values using a surprisingly simple and efficient approach. We provide extensive experimental evaluation to study the behavior of the proposed algorithm. Furthermore, we show the statistical significance of our results using recent innovations on randomization methods.

Topics: redescription mining.

10

Galbrun, Esther and Pauli Miettinen (2017). Redescription mining: an overview. IEEE Intelligent Informatics Bulletin, 18(2):7–12. URL: http://www.comp.hkbu.edu.hk/ iib/2017/Dec/article2/iib_vol18no2_article2.pdf. PDF. BibTeX.

@article{galbrun_redescription_2017,
    author = {Galbrun, Esther and Miettinen, Pauli},
    title = {Redescription Mining: An Overview.},
    volume = {18},
    url = {http://www.comp.hkbu.edu.hk/ iib/2017/Dec/article2/iib_vol18no2_article2.pdf},
    pages = {7--12},
    number = {2},
    journal = {{IEEE} Intelligent Informatics Bulletin},
    year = {2017}
}

Abstract.

In many real-world data analysis tasks, we have different types of data over the same objects or entities, perhaps because the data originate from distinct sources or are based on different terminologies. In order to understand such data, an intuitive approach is to identify the correspondences that exist between these different aspects. This is the motivating principle behind redescription mining, a data analysis task that aims at finding distinct common characterizations of the same objects. This paper provides a short overview of redescription mining; what it is and how it is connected to other data analysis methods; the basic principles behind current algorithms for redescription mining; and examples and applications of redescription mining for real-world data analysis problems.

Topics: redescription mining.

11

Galbrun, Esther, Hui Tang, et al. (2021). Redescription mining for analyzing local limiting conditions: a case study on the biogeography of large mammals in china and southern asia. Ecological Informatics, 63:101314. doi:10.1016/j.ecoinf.2021.101314. PDF. BibTeX.

@article{galbrun_redescription_2021,
    author = {Galbrun, Esther and Tang, Hui and Kaakinen, Anu and Žliobaitė, Indrė},
    title = {Redescription mining for analyzing local limiting conditions: A case study on the biogeography of large mammals in China and southern Asia},
    volume = {63},
    doi = {10.1016/j.ecoinf.2021.101314},
    pages = {101314},
    journal = {Ecological Informatics},
    year = {2021}
}

Abstract.

Identifying and understanding limiting conditions is at the centre of ecology and biogeography. Traditionally, associations between climate and occurrences of organisms are inferred from observational data using regression analysis, correlation analysis or clustering. Those methods extract patterns and relationships that hold throughout a dataset. We present a computational methodology called redescription mining, that emphasizes local patterns and associations that hold strongly on subsets of the dataset, instead. We aim to showcase the potential of this methodology for ecological and biogeographical studies, and encourage researchers to try it. Redescription mining can be used to identify associations between different descriptive views of the same system. It produces an ensemble of local models, that provide different perspectives over the system. Each model (redescription) consists of two sets of limiting conditions, over two different views, that hold locally. Limiting conditions, as well as the corresponding subregions, are identified automatically using data analysis algorithms. We explain how this methodology applies to a biogeographic case study focused on China and southern Asia. We consider dental traits of the large herbivorous mammals that occur there and climatic conditions as two aspects of this ecological system, and look for associations between them. Redescription mining can offer more refined inferences on the potential relation between variables describing different aspects of a system than classical methods. Thus, it permits different questions to be posed of the data, and can usefully complement classical methods in ecology and biogeography to uncover novel biogeographic patterns. A python package for carrying out redescription mining analysis is publicly available.

Topics: redescription mining.

12

Ganter, Bernhard, Sebastian Rudolph, and Gerd Stumme (2019). Explaining data with formal concept analysis. In Reasoning Web. Explainable Artificial Intelligence, Lecture Notes in Computer Science, pages 153–195. Springer International Publishing. doi:10.1007/978-3-030-31423-1_5. PDF. BibTeX.

@incollection{ganter_explaining_2019,
    author = {Ganter, Bernhard and Rudolph, Sebastian and Stumme, Gerd},
    title = {Explaining Data with Formal Concept Analysis},
    series = {Lecture Notes in Computer Science},
    pages = {153--195},
    booktitle = {Reasoning Web. Explainable Artificial Intelligence},
    publisher = {Springer International Publishing},
    year = {2019},
    doi = {10.1007/978-3-030-31423-1_5}
}

Abstract.

We give a brief introduction into Formal Concept Analysis, an approach to explaining data by means of lattice theory.

Topics: FCA.

13

Geng, Liqiang and Howard J. Hamilton (2006). Interestingness measures for data mining: a survey. ACM Computing Surveys (CSUR), 38(3):9. doi:10.1145/1132960.1132963. PDF. BibTeX.

@article{geng_interestingness_2006,
    author = {Geng, Liqiang and Hamilton, Howard J.},
    title = {Interestingness measures for data mining: A survey},
    volume = {38},
    doi = {10.1145/1132960.1132963},
    pages = {9},
    number = {3},
    journal = {{ACM} Computing Surveys ({CSUR})},
    year = {2006}
}

Abstract.

Interestingness measures play an important role in data mining, regardless of the kind of patterns being mined. These measures are intended for selecting and ranking patterns according to their potential interest to the user. Good measures also allow the time and space costs of the mining process to be reduced. This survey reviews the interestingness measures for rules and summaries, classifies them from several perspectives, compares their properties, identifies their roles in the data mining process, gives strategies for selecting appropriate measures for applications, and identifies opportunities for future research in this area.

Topics: interest measures.

14

Hämäläinen, Wilhelmiina (2010). StatApriori: an efficient algorithm for searching statistically significant association rules. Knowledge and Information Systems, 23(3):373–399. doi:10.1007/s10115-009-0229-8. PDF. BibTeX.

@article{hamalainen_statapriori_2010,
    author = {Hämäläinen, Wilhelmiina},
    title = {{StatApriori}: an efficient algorithm for searching statistically significant association rules},
    volume = {23},
    doi = {10.1007/s10115-009-0229-8},
    pages = {373--399},
    number = {3},
    journal = {Knowledge and Information Systems},
    year = {2010}
}

Abstract.

Searching statistically significant association rules is an important but neglected problem. Traditional association rules do not capture the idea of statistical dependence and the resulting rules can be spurious, while the most significant rules may be missing. This leads to erroneous models and predictions which often become expensive. The problem is computationally very difficult, because the significance is not a monotonic property. However, in this paper, we prove several other properties, which can be used for pruning the search space. The properties are implemented in the StatApriori algorithm, which searches statistically significant, non-redundant association rules. Empirical experiments have shown that StatApriori is very efficient, but in the same time it finds good quality rules.

Topics: interest measures.

15

Hämäläinen, Wilhelmiina (2012). Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowledge and Information Systems, 32(2):383–414. doi:10.1007/s10115-011-0432-2. PDF. BibTeX.

@article{hamalainen_kingfisher_2012,
    author = {Hämäläinen, Wilhelmiina},
    title = {Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures},
    volume = {32},
    doi = {10.1007/s10115-011-0432-2},
    pages = {383--414},
    number = {2},
    journal = {Knowledge and Information Systems},
    year = {2012}
}

Abstract.

Statistical dependency analysis is the basis of all empirical science. A commonly occurring problem is to find the most significant dependency rules, which describe either positive or negative dependencies between categorical attributes. In medical science, for example, one is interested in genetic factors, which can either predispose or prevent diseases. The requirement of statistical significance is essential, because the discoveries should hold also in future data. Typically, the significance is estimated either by Fisher’s exact test or the χ2-measure. The problem is computationally very difficult, because the number of all possible dependency rules increases exponentially with the number of attributes. As a solution, different kinds of restrictions and heuristics have been applied, but a general, scalable search method has been missing. In this paper, we introduce an efficient algorithm, called Kingfisher, for searching for the best non-redundant dependency rules with statistical significance measures. The rules can express either positive or negative dependencies between a set of positive attributes and a single consequent attribute. The algorithm itself is independent from the used goodness measure, but we concentrate on Fisher’s exact test and the χ2-measure. The algorithm is based on an application of the branch-and-bound search strategy, supplemented by several pruning properties. Especially, we prove a new lower bound for Fisher’s p and introduce a new effective pruning principle. According to our experiments on classical benchmark data, the algorithm is well scalable and can efficiently handle even dense and high-dimensional data sets. An interesting observation was that Fisher’s exact test did not only produce more reliable rules than the χ2-measure, but it also performed the search much faster.

Topics: interest measures.

16

Hämäläinen, Wilhelmiina (2014). Assessing the statistical significance of association rules. arXiv:1405.1360 [cs]. URL: http://arxiv.org/abs/1405.1360. PDF. BibTeX.

@article{hamalainen_assessing_2014,
    author = {Hämäläinen, Wilhelmiina},
    title = {Assessing the statistical significance of association rules},
    url = {http://arxiv.org/abs/1405.1360},
    journal = {{arXiv}:1405.1360 [cs]},
    year = {2014}
}

Abstract.

An association rule is statistically significant, if it has a small probability to occur by chance. It is well-known that the traditional frequency-confidence framework does not produce statistically significant rules. It can both accept spurious rules (type 1 error) and reject significant rules (type 2 error). The same problem concerns other commonly used interestingness measures and pruning heuristics. In this paper, we inspect the most common measure functions - frequency, confidence, degree of dependence, x2, correlation coefficient, and J-measure - and redundancy reduction techniques. For each technique, we analyze whether it can make type 1 or type 2 error and the conditions under which the error occurs. In addition, we give new theoretical results which can be use to guide the search for statistically significant association rules.

Topics: interest measures.

17

Hämäläinen, Wilhelmiina and Geoffrey I. Webb (2018). A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery. doi:10.1007/s10618-018-0590-x. PDF. BibTeX.

@article{hamalainen_tutorial_2018,
    author = {Hämäläinen, Wilhelmiina and Webb, Geoffrey I.},
    title = {A tutorial on statistically sound pattern discovery},
    doi = {10.1007/s10618-018-0590-x},
    journal = {Data Mining and Knowledge Discovery},
    year = {2018}
}

Abstract.

Statistically sound pattern discovery harnesses the rigour of statistical hypothesis testing to overcome many of the issues that have hampered standard data mining approaches to pattern discovery. Most importantly, application of appropriate statistical tests allows precise control over the risk of false discoveries—patterns that are found in the sample data but do not hold in the wider population from which the sample was drawn. Statistical tests can also be applied to filter out patterns that are unlikely to be useful, removing uninformative variations of the key patterns in the data. This tutorial introduces the key statistical and data mining theory and techniques that underpin this fast developing field. We concentrate on two general classes of patterns: dependency rules that express statistical dependencies between condition and consequent parts and dependency sets that express mutual dependence between set elements. We clarify alternative interpretations of statistical dependence and introduce appropriate tests for evaluating statistical significance of patterns in different situations. We also introduce special techniques for controlling the likelihood of spurious discoveries when multitudes of patterns are evaluated. The paper is aimed at a wide variety of audiences. It provides the necessary statistical background and summary of the state-of-the-art for any data mining researcher or practitioner wishing to enter or understand statistically sound pattern discovery research or practice. It can serve as a general introduction to the field of statistically sound pattern discovery for any reader with a general background in data sciences.

Topics: interest measures.

18

Hébert, Céline and Bruno Crémilleux (2007). A unified view of objective interestingness measures. In Proceedings of the International Workshop on Machine Learning and Data Mining in Pattern Recognition, MLDM '07, Lecture Notes in Computer Science, 533–547. Springer. doi:10.1007/978-3-540-73499-4_40. PDF. BibTeX.

@inproceedings{hebert_unified_2007,
    author = {Hébert, Céline and Crémilleux, Bruno},
    title = {A Unified View of Objective Interestingness Measures},
    doi = {10.1007/978-3-540-73499-4_40},
    series = {Lecture Notes in Computer Science},
    pages = {533--547},
    booktitle = {Proceedings of the International Workshop on Machine Learning and Data Mining in Pattern Recognition, {MLDM} '07},
    publisher = {Springer},
    year = {2007}
}

Abstract.

Association rule mining often results in an overwhelming number of rules. In practice, it is difficult for the final user to select the most relevant rules. In order to tackle this problem, various interestingness measures were proposed. Nevertheless, the choice of an appropriate measure remains a hard task and the use of several measures may lead to conflicting information. In this paper, we give a unified view of objective interestingness measures. We define a new framework embedding a large set of measures called SBMs and we prove that the SBMs have a similar behavior. Furthermore, we identify the whole collection of the rules simultaneously optimizing all the SBMs. We provide an algorithm to efficiently mine a reduced set of rules among the rules optimizing all the SBMs. Experiments on real datasets highlight the characteristics of such rules.

Topics: interest measures.

19

Herrera, Franciso et al. (2011). An overview on subgroup discovery: foundations and applications. Knowledge and Information Systems, 29(3):495–525. doi:10.1007/s10115-010-0356-2. PDF. BibTeX.

@article{herrera_overview_2011,
    author = {Herrera, Franciso and Carmona, Cristóbal José and González, Pedro and del Jesus, María José},
    title = {An overview on subgroup discovery: foundations and applications},
    volume = {29},
    doi = {10.1007/s10115-010-0356-2},
    pages = {495--525},
    number = {3},
    journal = {Knowledge and Information Systems},
    year = {2011}
}

Abstract.

Subgroup discovery is a data mining technique which extracts interesting rules with respect to a target variable. An important characteristic of this task is the combination of predictive and descriptive induction. An overview related to the task of subgroup discovery is presented. This review focuses on the foundations, algorithms, and advanced studies together with the applications of subgroup discovery presented throughout the specialised bibliography.

Topics: SD, IIa.

20

Kaytoue, Mehdi, Sergei O Kuznetsov, and Amedeo Napoli (2011). Revisiting numerical pattern mining with formal concept analysis. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, IJCAI '11, 1342–1347. IJCAI/AAAI. URL: http://ijcai.org/papers11/Papers/IJCAI11-227.pdf. PDF. BibTeX.

@inproceedings{kaytoue_revisiting_2011,
    author = {Kaytoue, Mehdi and Kuznetsov, Sergei O and Napoli, Amedeo},
    title = {Revisiting Numerical Pattern Mining with Formal Concept Analysis},
    url = {http://ijcai.org/papers11/Papers/IJCAI11-227.pdf},
    pages = {1342--1347},
    booktitle = {Proceedings of the 22nd International Joint Conference on Artificial Intelligence, {IJCAI} '11},
    publisher = {{IJCAI}/{AAAI}},
    year = {2011}
}

Abstract.

We investigate the problem of mining numerical data with Formal Concept Analysis. The usual way is to use a scaling procedure -transforming numerical attributes into binary ones- leading either to a loss of information or of efficiency, in particular w.r.t. the volume of extracted patterns. By contrast, we propose to directlywork on numerical data in a more precise and efficient way. For that, the notions of closed patterns, generators and equivalent classes are revisited in the numerical context. Moreover, two algorithms are proposed and tested in an evaluation involving real-world data, showing the quality of the present approach.

Topics: FCA.

21

Khiari, Mehdi, Patrice Boizumault, and Bruno Crémilleux (2010). Constraint programming for mining n-ary patterns. In Proceedings International Conference on Principles and Practice of Constraint Programming, CP '10, Lecture Notes in Computer Science, 552–567. Springer. doi:10.1007/978-3-642-15396-9_44. PDF. BibTeX.

@inproceedings{khiari_constraint_2010,
    author = {Khiari, Mehdi and Boizumault, Patrice and Crémilleux, Bruno},
    title = {Constraint Programming for Mining n-ary Patterns},
    doi = {10.1007/978-3-642-15396-9_44},
    series = {Lecture Notes in Computer Science},
    pages = {552--567},
    booktitle = {Proceedings International Conference on Principles and Practice of Constraint Programming, {CP} '10},
    publisher = {Springer},
    year = {2010}
}

Abstract.

The aim of this paper is to model and mine patterns combining several local patterns (n-ary patterns). First, the user expresses his/her query under constraints involving n-ary patterns. Second, a constraint solver generates the correct and complete set of solutions. This approach enables to model in a flexible way sets of constraints combining several local patterns and it leads to discover patterns of higher level. Experiments show the feasibility and the interest of our approach.

Topics: constraints.

22

Kralj, Petra et al. (2007). Contrast set mining through subgroup discovery applied to brain ischaemina data. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD '07, Lecture Notes in Computer Science, 579–586. Springer. doi:10.1007/978-3-540-71701-0_61. PDF. BibTeX.

@inproceedings{kralj_contrast_2007,
    author = {Kralj, Petra and Lavrač, Nada and Gamberger, Dragan and Krstačić, Antonija},
    title = {Contrast Set Mining Through Subgroup Discovery Applied to Brain Ischaemina Data},
    doi = {10.1007/978-3-540-71701-0_61},
    series = {Lecture Notes in Computer Science},
    pages = {579--586},
    booktitle = {Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, {PAKDD} '07},
    publisher = {Springer},
    year = {2007}
}

Abstract.

Contrast set mining aims at finding differences between different groups. This paper shows that a contrast set mining task can be transformed to a subgroup discovery task whose goal is to find descriptions of groups of individuals with unusual distributional characteristics with respect to the given property of interest. The proposed approach to contrast set mining through subgroup discovery was successfully applied to the analysis of records of patients with brain stroke (confirmed by a positive CT test), in contrast with patients with other neurological symptoms and disorders (having normal CT test results). Detection of coexisting risk factors, as well as description of characteristic patient subpopulations are important outcomes of the analysis.

Topics: SD, IIa.

23

Kralj-Novak, Petra, Nada Lavrač, and Geoffrey I Webb (2009). Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research, 10:377–403. URL: http://jmlr.org/papers/v10/kralj-novak09a.html. PDF. BibTeX.

@article{kralj-novak_supervised_2009,
    author = {Kralj-Novak, Petra and Lavrač, Nada and Webb, Geoffrey I},
    title = {Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining},
    volume = {10},
    url = {http://jmlr.org/papers/v10/kralj-novak09a.html},
    pages = {377--403},
    journal = {Journal of Machine Learning Research},
    year = {2009}
}

Abstract.

This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.

Topics: IIa.

24

Lavrač, Nada, Peter Flach, and Blaz Zupan (1999). Rule evaluation measures: a unifying view. In Proceedings of the International Conference on Inductive Logic Programming, ILP '99, Lecture Notes in Computer Science, 174–185. Springer. doi:10.1007/3-540-48751-4_17. PDF. BibTeX.

@inproceedings{lavrac_rule_1999,
    author = {Lavrač, Nada and Flach, Peter and Zupan, Blaz},
    title = {Rule Evaluation Measures: A Unifying View},
    doi = {10.1007/3-540-48751-4_17},
    series = {Lecture Notes in Computer Science},
    pages = {174--185},
    booktitle = {Proceedings of the International Conference on Inductive Logic Programming, {ILP} '99},
    publisher = {Springer},
    year = {1999}
}

Abstract.

Numerous measures are used for performance evaluation in machine learning. In predictive knowledge discovery, the most frequently used measure is classification accuracy. With new tasks being addressed in knowledge discovery, new measures appear. In descriptive knowledge discovery, where induced rules are not primarily intended for classification, new measures used are novelty in clausal and subgroup discovery, and support and confidence in association rule learning. Additional measures are needed as many descriptive knowledge discovery tasks involve the induction of a large set of redundant rules and the problem is the ranking and filtering of the induced rule set. In this paper we develop a unifying view on some of the existing measures for predictive and descriptive induction. We provide a common terminology and notation by means of contingency tables. We demonstrate how to trade off these measures, by using what we call weighted relative accuracy. The paper furthermore demonstrates that many rule evaluation measures developed for predictive knowledge discovery can be adapted to descriptive knowledge discovery tasks.

Topics: interest measures.

25

van Leeuwen, Matthijs (2010). Maximal exceptions with minimal descriptions. Data Mining and Knowledge Discovery, 21(2):259–276. doi:10.1007/s10618-010-0187-5. PDF. BibTeX.

@article{van_leeuwen_maximal_2010,
    author = {van Leeuwen, Matthijs},
    title = {Maximal exceptions with minimal descriptions},
    volume = {21},
    doi = {10.1007/s10618-010-0187-5},
    pages = {259--276},
    number = {2},
    journal = {Data Mining and Knowledge Discovery},
    year = {2010}
}

Abstract.

We introduce a new approach to Exceptional Model Mining. Our algorithm, called EMDM, is an iterative method that alternates between Exception Maximisation and Description Minimisation. As a result, it finds maximally exceptional models with minimal descriptions. Exceptional Model Mining was recently introduced by Leman et al. (Exceptional model mining 1–16, 2008) as a generalisation of Subgroup Discovery. Instead of considering a single target attribute, it allows for multiple ‘model’ attributes on which models are fitted. If the model for a subgroup is substantially different from the model for the complete database, it is regarded as an exceptional model. To measure exceptionality, we propose two information-theoretic measures. One is based on the Kullback–Leibler divergence, the other on Krimp. We show how compression can be used for exception maximisation with these measures, and how classification can be used for description minimisation. Experiments show that our approach efficiently identifies subgroups that are both exceptional and interesting.

Topics: EMM, IIb.

26

Lijffijt, Jefrey et al. (2018). Subjectively interesting subgroup discovery on real-valued targets. In Proceedings of the 34th IEEE International Conference on Data Engineering, ICDE '18, 1352–1355. IEEE Computer Society. doi:10.1109/ICDE.2018.00148. PDF. BibTeX.

@inproceedings{lijffijt_subjectively_2018,
    author = {Lijffijt, Jefrey and Kang, Bo and Duivesteijn, Wouter and Puolamaki, Kai and Oikarinen, Emilia and De Bie, Tijl},
    title = {Subjectively Interesting Subgroup Discovery on Real-Valued Targets},
    doi = {10.1109/ICDE.2018.00148},
    pages = {1352--1355},
    booktitle = {Proceedings of the 34th {IEEE} International Conference on Data Engineering, {ICDE} '18},
    publisher = {{IEEE} Computer Society},
    year = {2018}
}

Abstract.

Deriving insights from high-dimensional data is one of the core problems in data mining. The difficulty mainly stems from the large number of variable combinations to potentially consider. Hence, an obvious question is whether we can automate the search for interesting patterns. Here, we consider the setting where a user wants to learn as efficiently as possible about real-valued attributes. We introduce a method to find subgroups in the data that are maximally informative (in the Information Theoretic sense) with respect to one or more real-valued target attributes. The succinct subgroup descriptions are in terms of arbitrarily-typed description attributes. The approach is based on the Subjective Interestingness framework FORSIED to use prior knowledge when mining most informative patterns.

Topics: SD, maximum entropy.

27

Magnani, Matteo et al. (2013). SkyView: a user evaluation of the skyline operator. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, CIKM '13, 2249–2254. ACM Press. doi:10.1145/2505515.2505739. PDF. BibTeX.

@inproceedings{magnani_skyview_2013,
    author = {Magnani, Matteo and Assent, Ira and Hornbæk, Kasper and Jakobsen, Mikkel R. and Larsen, Ken Friis},
    title = {{SkyView}: a user evaluation of the skyline operator},
    doi = {10.1145/2505515.2505739},
    series = {{CIKM} '13},
    pages = {2249--2254},
    booktitle = {Proceedings of the 22nd {ACM} international conference on Information \& Knowledge Management},
    publisher = {{ACM} Press},
    year = {2013}
}

Abstract.

The skyline operator has recently emerged as an alternative to ranking queries. It retrieves a number of potential best options for arbitrary monotone preference functions. The success of this operator in the database community is based on the belief that users benefit from the limited effort required to specify skyline queries compared to, for instance, ranking. While application examples of the skyline operator exist, there is no principled analysis of its benefits and limitations in data retrieval tasks. Our study investigates the degree to which users understand skyline queries, how they specify query parameters and how they interact with skyline results made available in listings or map-based interfaces.

Topics: skypatterns.

28

Métivier, Jean-Philippe et al. (2012). A constraint language for declarative pattern discovery. In Proceedings of the 27th Annual ACM Symposium on Applied Computing, SAC '12, 119–125. ACM Press. doi:10.1145/2245276.2245302. PDF. BibTeX.

@inproceedings{metivier_constraint_2012,
    author = {Métivier, Jean-Philippe and Boizumault, Patrice and Crémilleux, Bruno and Khiari, Mehdi and Loudni, Samir},
    title = {A constraint language for declarative pattern discovery},
    doi = {10.1145/2245276.2245302},
    pages = {119--125},
    booktitle = {Proceedings of the 27th Annual {ACM} Symposium on Applied Computing, {SAC} '12},
    publisher = {{ACM} Press},
    year = {2012}
}

Abstract.

Discovering pattern sets or global patterns is an attractive issue from the pattern mining community in order to provide useful information. By combining local patterns satisfying a joint meaning, this approach produces patterns of higher level and thus more useful for the end-user than the usual local patterns. In parallel, recent works investigating relationships between data mining and constraint programming (CP) show that the CP paradigm is a powerful framework to model and mine patterns in a declarative and generic way. We present a constraint-based language which enables us to define queries in a declarative way addressing patterns sets and global patterns. By specifying what the task is, rather than providing how the solution should be computed, it is easy to process by stepwise refinements to successfully discover global patterns. The usefulness of the approach is highlighted by several examples coming from the clustering based on associations. All primitive constraints of the language are modeled and solved using the SAT framework. We illustrate the efficiency of our approach through several experiments.

Topics: constraints.

29

Miettinen, Pauli and Jilles Vreeken (2014). MDL4bmf: minimum description length for boolean matrix factorization. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(4):18:1–18:31. doi:10.1145/2601437. PDF. BibTeX.

@article{miettinen_mdl4bmf_2014,
    author = {Miettinen, Pauli and Vreeken, Jilles},
    title = {{MDL}4BMF: Minimum Description Length for Boolean Matrix Factorization},
    volume = {8},
    doi = {10.1145/2601437},
    pages = {18:1--18:31},
    number = {4},
    journal = {{ACM} Transactions on Knowledge Discovery from Data ({TKDD})},
    year = {2014}
}

Abstract.

Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the “model order selection problem” of determining the proper rank of the factorization, that is, to answer where fine-grained structure stops, and where noise starts. Boolean Matrix Factorization (BMF)—where data, factors, and matrix product are Boolean—has in recent years received increased attention from the data mining community. The technique has desirable properties, such as high interpretability and natural sparsity. Yet, so far no method for selecting the correct model order for BMF has been available. In this article, we propose the use of the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits; for example, it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding: starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model–based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.

Topics: MDL principle.

30

Mihelčić, Matej, Sašo Dźeroski, et al. (2017). A framework for redescription set construction. Expert Systems with Applications, 68:196–215. doi:10.1016/j.eswa.2016.10.012. PDF. BibTeX.

@article{mihelcic_framework_2017,
    author = {Mihelčić, Matej and Dźeroski, Sašo and Lavrač, Nada and Šmuc, Tomislav},
    title = {A framework for redescription set construction},
    volume = {68},
    doi = {10.1016/j.eswa.2016.10.012},
    pages = {196--215},
    issue = {C},
    journal = {Expert Systems with Applications},
    year = {2017}
}

Abstract.

A framework producing large number of highly accurate redescriptions is proposed.User-guided multi-objective optimization used for redescription set construction.Conjunctive refinement significantly increases redescription accuracy.Using Pessimistic Jaccard index can lead to discarding some valuable redescriptions.Proposed approach has advantages over currently available approaches. Redescription mining is a field of knowledge discovery that aims at finding different descriptions of similar subsets of instances in the data. These descriptions are represented as rules inferred from one or more disjoint sets of attributes, called views. As such, they support knowledge discovery process and help domain experts in formulating new hypotheses or constructing new knowledge bases and decision support systems. In contrast to previous approaches that typically create one smaller set of redescriptions satisfying a pre-defined set of constraints, we introduce a framework that creates large and heterogeneous redescription set from which user/expert can extract compact sets of differing properties, according to its own preferences. Construction of large and heterogeneous redescription set relies on CLUS-RM algorithm and a novel, conjunctive refinement procedure that facilitates generation of larger and more accurate redescription sets. The work also introduces the variability of redescription accuracy when missing values are present in the data, which significantly extends applicability of the method. Crucial part of the framework is the redescription set extraction based on heuristic multi-objective optimization procedure that allows user to define importance levels towards one or more redescription quality criteria. We provide both theoretical and empirical comparison of the novel framework against current state of the art redescription mining algorithms and show that it represents more efficient and versatile approach for mining redescriptions from data.

Topics: redescription mining.

31

Mihelčić, Matej, Goran Šimić, et al. (2017). Using redescription mining to relate clinical and biological characteristics of cognitively impaired and alzheimer’s disease patients. PLOS ONE, 12(10):e0187364. doi:10.1371/journal.pone.0187364. PDF. BibTeX.

@article{mihelcic_using_2017,
    author = {Mihelčić, Matej and Šimić, Goran and Leko, Mirjana Babić and Lavrač, Nada and Džeroski, Sašo and Šmuc, Tomislav and Initiative, for the Alzheimer’s Disease Neuroimaging},
    title = {Using redescription mining to relate clinical and biological characteristics of cognitively impaired and Alzheimer’s disease patients},
    volume = {12},
    doi = {10.1371/journal.pone.0187364},
    pages = {e0187364},
    number = {10},
    journal = {{PLOS} {ONE}},
    year = {2017}
}

Abstract.

Based on a set of subjects and a collection of attributes obtained from the Alzheimer’s Disease Neuroimaging Initiative database, we used redescription mining to find interpretable rules revealing associations between those determinants that provide insights about the Alzheimer’s disease (AD). We extended the CLUS-RM redescription mining algorithm to a constraint-based redescription mining (CBRM) setting, which enables several modes of targeted exploration of specific, user-constrained associations. Redescription mining enabled finding specific constructs of clinical and biological attributes that describe many groups of subjects of different size, homogeneity and levels of cognitive impairment. We confirmed some previously known findings. However, in some instances, as with the attributes: testosterone, ciliary neurotrophic factor, brain natriuretic peptide, Fas ligand, the imaging attribute Spatial Pattern of Abnormalities for Recognition of Early AD, as well as the levels of leptin and angiopoietin-2 in plasma, we corroborated previously debatable findings or provided additional information about these variables and their association with AD pathogenesis. Moreover, applying redescription mining on ADNI data resulted with the discovery of one largely unknown attribute: the Pregnancy-Associated Protein-A (PAPP-A), which we found highly associated with cognitive impairment in AD. Statistically significant correlations (p ≤ 0.01) were found between PAPP-A and clinical tests: Alzheimer’s Disease Assessment Scale, Clinical Dementia Rating Sum of Boxes, Mini Mental State Examination, etc. The high importance of this finding lies in the fact that PAPP-A is a metalloproteinase, known to cleave insulin-like growth factor binding proteins. Since it also shares similar substrates with A Disintegrin and the Metalloproteinase family of enzymes that act as α-secretase to physiologically cleave amyloid precursor protein (APP) in the non-amyloidogenic pathway, it could be directly involved in the metabolism of APP very early during the disease course. Therefore, further studies should investigate the role of PAPP-A in the development of AD more thoroughly.

Topics: redescription mining.

32

Négrevergne, Benjamin et al. (2013). Dominance programming for itemset mining. In Proceedings of the 13th IEEE International Conference on Data Mining, ICDM '13, 557–566. IEEE Computer Society. doi:10.1109/ICDM.2013.92. PDF. BibTeX.

@inproceedings{negrevergne_dominance_2013,
    author = {Négrevergne, Benjamin and Dries, Anton and Guns, Tias and Nijssen, Siegfried},
    title = {Dominance Programming for Itemset Mining},
    doi = {10.1109/ICDM.2013.92},
    pages = {557--566},
    booktitle = {Proceedings of the 13th {IEEE} International Conference on Data Mining, {ICDM} '13},
    publisher = {{IEEE} Computer Society},
    year = {2013}
}

Abstract.

Finding small sets of interesting patterns is an important challenge in pattern mining. In this paper, we argue that several well-known approaches that address this challenge are based on performing pair wise comparisons between patterns. Examples include finding closed patterns, free patterns, relevant subgroups and skyline patterns. Although progress has been made on each of these individual problems, a generic approach for solving these problems (and more) is still lacking. This paper tackles this challenge. It proposes a novel, generic approach for handling pattern mining problems that involve pair wise comparisons between patterns. Our key contributions are the following. First, we propose a novel algebra for programming pattern mining problems. This algebra extends relational algebras in a novel way towards pattern mining. It allows for the generic combination of constraints on individual patterns with dominance relations between patterns. Second, we introduce a modified generic constraint satisfaction system to evaluate these algebraic expressions. Experiments show that this generic approach can indeed effectively identify patterns expressed in the algebra.

Topics: skypatterns.

33

Paramonov, Sergey, Daria Stepanova, and Pauli Miettinen (2017). Hybrid ASP-based approach to pattern mining. In Proceedings of the International Joint Conference on Rules and Reasoning, RuleML+RR '17, 199–214. Springer International Publishing. doi:10.1007/978-3-319-61252-2_14. PDF. BibTeX.

@inproceedings{paramonov_hybrid_2017,
    author = {Paramonov, Sergey and Stepanova, Daria and Miettinen, Pauli},
    title = {Hybrid {ASP}-Based Approach to Pattern Mining},
    doi = {10.1007/978-3-319-61252-2_14},
    pages = {199--214},
    booktitle = {Proceedings of the International Joint Conference on Rules and Reasoning, {RuleML}+{RR} '17},
    publisher = {Springer International Publishing},
    year = {2017}
}

Abstract.

Detecting small sets of relevant patterns from a given dataset is a central challenge in data mining. The relevance of a pattern is based on user-provided criteria; typically, all patterns that satisfy certain criteria are considered relevant. Rule-based languages like Answer Set Programming (ASP) seem well-suited for specifying such criteria in a form of constraints. Although progress has been made, on the one hand, on solving individual mining problems and, on the other hand, developing generic mining systems, the existing methods either focus on scalability or on generality. In this paper we make steps towards combining local (frequency, size, cost) and global (various condensed representations like maximal, closed, skyline) constraints in a generic and efficient way. We present a hybrid approach for itemset and sequence mining which exploits dedicated highly optimized mining systems to detect frequent patterns and then filters the results using declarative ASP. Experiments on real-world datasets show the effectiveness of the proposed method and computational gains both for itemset and sequence mining.

Topics: constraints.

34

Priss, Uta (2007). Formal concept analysis in information science. Annual Review of Information Science and Technology, 40(1):521–543. PDF. BibTeX.

@article{priss_formal_2007,
    author = {Priss, Uta},
    title = {Formal concept analysis in information science},
    volume = {40},
    pages = {521--543},
    number = {1},
    journal = {Annual Review of Information Science and Technology},
    year = {2007}
}

Topics: FCA.

35

Puolamäki, Kai et al. (2020). Interactive visual data exploration with subjective feedback: an information-theoretic approach. Data Mining and Knowledge Discovery, 34(1):21–49. doi:10.1007/s10618-019-00655-x. PDF. BibTeX.

@article{puolamaki_interactive_2020,
    author = {Puolamäki, Kai and Oikarinen, Emilia and Kang, Bo and Lijffijt, Jefrey and De Bie, Tijl},
    title = {Interactive visual data exploration with subjective feedback: an information-theoretic approach},
    volume = {34},
    doi = {10.1007/s10618-019-00655-x},
    pages = {21--49},
    number = {1},
    journal = {Data Mining and Knowledge Discovery},
    year = {2020}
}

Abstract.

Visual exploration of high-dimensional real-valued datasets is a fundamental task in exploratory data analysis (EDA). Existing projection methods for data visualization use predefined criteria to choose the representation of data. There is a lack of methods that (i) use information on what the user has learned from the data and (ii) show patterns that she does not know yet. We construct a theoretical model where identified patterns can be input as knowledge to the system. The knowledge syntax here is intuitive, such as “this set of points forms a cluster”, and requires no knowledge of maths. This background knowledge is used to find a maximum entropy distribution of the data, after which the user is provided with data projections for which the data and the maximum entropy distribution differ the most, hence showing the user aspects of data that are maximally informative given the background knowledge. We study the computational performance of our model and present use cases on synthetic and real data. We find that the model allows the user to learn information efficiently from various data sources and works sufficiently fast in practice. In addition, we provide an open source EDA demonstrator system implementing our model with tailored interactive visualizations. We conclude that the information theoretic approach to EDA where patterns observed by a user are formalized as constraints provides a principled, intuitive, and efficient basis for constructing an EDA system.

Topics: maximum entropy.

36

Smets, Koen and Jilles Vreeken (2012). Slim: directly mining descriptive patterns. In Proceedings of the 2012 SIAM International Conference on Data Mining, SDM '12, 236–247. SIAM. doi:10.1137/1.9781611972825.21. PDF. BibTeX.

@inproceedings{smets_slim_2012,
    author = {Smets, Koen and Vreeken, Jilles},
    title = {Slim: Directly Mining Descriptive Patterns},
    doi = {10.1137/1.9781611972825.21},
    pages = {236--247},
    booktitle = {Proceedings of the 2012 {SIAM} International Conference on Data Mining, {SDM} '12},
    publisher = {{SIAM}},
    year = {2012}
}

Abstract.

Compression based pattern mining has been successfully applied to many data mining tasks. We propose an approach based on the minimum description length principle to extract sequential patterns that compress a database of sequences well. We show that mining compressing patterns is NP-Hard and belongs to the class of inapproximable problems. We propose two heuristic algorithms to mining compressing patterns. The first uses a two-phase approach similar to Krimp for itemset data. To overcome performance with the required candidate generation we propose GoKrimp, an effective greedy algorithm that directly mines compressing patterns. We conduct an empirical study on six real-life datasets to compare the proposed algorithms by run time, compressibility, and classification accuracy using the patterns found as features for SVM classifiers.

Topics: MDL principle.

37

Strona, Giovanni, Werner Ulrich, and Nicholas J. Gotelli (2018). Bi-dimensional null model analysis of presence-absence binary matrices. Ecology, 99(1):103–115. doi:10.1002/ecy.2043. PDF. BibTeX.

@article{strona_bi-dimensional_2018,
    author = {Strona, Giovanni and Ulrich, Werner and Gotelli, Nicholas J.},
    title = {Bi-dimensional null model analysis of presence-absence binary matrices},
    volume = {99},
    doi = {10.1002/ecy.2043},
    pages = {103--115},
    number = {1},
    journal = {Ecology},
    year = {2018}
}

Abstract.

Comparing the structure of presence/absence (i.e., binary) matrices with those of randomized counterparts is a common practice in ecology. However, differences in the randomization procedures (null models) can affect the results of the comparisons, leading matrix structural patterns to appear either “random” or not. Subjectivity in the choice of one particular null model over another makes it often advisable to compare the results obtained using several different approaches. Yet, available algorithms to randomize binary matrices differ substantially in respect to the constraints they impose on the discrepancy between observed and randomized row and column marginal totals, which complicates the interpretation of contrasting patterns. This calls for new strategies both to explore intermediate scenarios of restrictiveness in-between extreme constraint assumptions, and to properly synthesize the resulting information. Here we introduce a new modeling framework based on a flexible matrix randomization algorithm (named the “Tuning Peg” algorithm) that addresses both issues. The algorithm consists of a modified swap procedure in which the discrepancy between the row and column marginal totals of the target matrix and those of its randomized counterpart can be “tuned” in a continuous way by two parameters (controlling, respectively, row and column discrepancy). We show how combining the Tuning Peg with a wise random walk procedure makes it possible to explore the complete null space embraced by existing algorithms. This exploration allows researchers to visualize matrix structural patterns in an innovative bi-dimensional landscape of significance/effect size. We demonstrate the rational and potential of our approach with a set of simulated and real matrices, showing how the simultaneous investigation of a comprehensive and continuous portion of the null space can be extremely informative, and possibly key to resolving longstanding debates in the analysis of ecological matrices.

Topics: randomization.

38

Tatti, Nikolaj (2008). Maximum entropy based significance of itemsets. Knowledge and Information Systems, 17(1):57–77. doi:10.1007/s10115-008-0128-4. PDF. BibTeX.

@article{tatti_maximum_2008,
    author = {Tatti, Nikolaj},
    title = {Maximum entropy based significance of itemsets},
    volume = {17},
    doi = {10.1007/s10115-008-0128-4},
    pages = {57--77},
    number = {1},
    journal = {Knowledge and Information Systems},
    year = {2008}
}

Abstract.

We consider the problem of defining the significance of an itemset. We say that the itemset is significant if we are surprised by its frequency when compared to the frequencies of its sub-itemsets. In other words, we estimate the frequency of the itemset from the frequencies of its sub-itemsets and compute the deviation between the real value and the estimate. For the estimation we use Maximum Entropy and for measuring the deviation we use Kullback–Leibler divergence. A major advantage compared to the previous methods is that we are able to use richer models whereas the previous approaches only measure the deviation from the independence model. We show that our measure of significance goes to zero for derivable itemsets and that we can use the rank as a statistical test. Our empirical results demonstrate that for our real datasets the independence assumption is too strong but applying more flexible models leads to good results.

Topics: maximum entropy.

39

Tatti, Nikolaj (2010). Probably the best itemsets. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '10, 293–302. ACM Press. doi:10.1145/1835804.1835843. PDF. BibTeX.

@inproceedings{tatti_probably_2010,
    author = {Tatti, Nikolaj},
    title = {Probably the best itemsets},
    doi = {10.1145/1835804.1835843},
    pages = {293--302},
    booktitle = {Proceedings of the 16th {ACM} {SIGKDD} international conference on Knowledge discovery and data mining, {KDD} '10},
    publisher = {{ACM} Press},
    year = {2010}
}

Abstract.

One of the main current challenges in itemset mining is to discover a small set of high-quality itemsets. In this paper we propose a new and general approach for measuring the quality of itemsets. The method is solidly founded in Bayesian statistics and decreases monotonically, allowing for efficient discovery of all interesting itemsets. The measure is defined by connecting statistical models and collections of itemsets. This allows us to score individual itemsets with the probability of them occuring in random models built on the data. As a concrete example of this framework we use exponential models. This class of models possesses many desirable properties. Most importantly, Occam's razor in Bayesian model selection provides a defence for the pattern explosion. As general exponential models are infeasible in practice, we use decomposable models; a large sub-class for which the measure is solvable. For the actual computation of the score we sample models from the posterior distribution using an MCMC approach. Experimentation on our method demonstrates the measure works in practice and results in interpretable and insightful itemsets for both synthetic and real-world data.

Topics: interest measures.

40

Tatti, Nikolaj and Jilles Vreeken (2012a). Discovering descriptive tile trees. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD '12, Lecture Notes in Computer Science, 9–24. Springer Berlin Heidelberg. doi:10.1007/978-3-642-33460-3_6. PDF. BibTeX.

@inproceedings{tatti_discovering_2012,
    author = {Tatti, Nikolaj and Vreeken, Jilles},
    title = {Discovering Descriptive Tile Trees},
    doi = {10.1007/978-3-642-33460-3_6},
    series = {Lecture Notes in Computer Science},
    pages = {9--24},
    booktitle = {Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, {ECML} {PKDD} '12},
    publisher = {Springer Berlin Heidelberg},
    year = {2012}
}

Abstract.

When analysing binary data, the ease at which one can interpret results is very important. Many existing methods, however, discover either models that are difficult to read, or return so many results interpretation becomes impossible. Here, we study a fully automated approach for mining easily interpretable models for binary data. We model data hierarchically with noisy tiles—rectangles with significantly different density than their parent tile. To identify good trees, we employ the Minimum Description Length principle.We propose Stijl, a greedy any-time algorithm for mining good tile trees from binary data. Iteratively, it finds the locally optimal addition to the current tree, allowing overlap with tiles of the same parent. A major result of this paper is that we find the optimal tile in only Θ(NM min(N,M)) time. Stijl can either be employed as a top-k miner, or by MDL we can identify the tree that describes the data best.Experiments show we find succinct models that accurately summarise the data, and, by their hierarchical property are easily interpretable.

Topics: MDL principle.

41

Tatti, Nikolaj and Jilles Vreeken (2012b). The long and the short of it: summarising event sequences with serial episodes. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '12, 462–470. ACM Press. doi:10.1145/2339530.2339606. PDF. BibTeX.

@inproceedings{tatti_long_2012,
    author = {Tatti, Nikolaj and Vreeken, Jilles},
    title = {The long and the short of it: summarising event sequences with serial episodes},
    doi = {10.1145/2339530.2339606},
    pages = {462--470},
    booktitle = {Proceedings of the 18th {ACM} {SIGKDD} international conference on Knowledge discovery and data mining, {KDD} '12},
    publisher = {{ACM} Press},
    year = {2012}
}

Abstract.

An ideal outcome of pattern mining is a small set of informative patterns, containing no redundancy or noise, that identifies the key structure of the data at hand. Standard frequent pattern miners do not achieve this goal, as due to the pattern explosion typically very large numbers of highly redundant patterns are returned. We pursue the ideal for sequential data, by employing a pattern set mining approach - an approach where, instead of ranking patterns individually, we consider results as a whole. Pattern set mining has been successfully applied to transactional data, but has been surprisingly understudied for sequential data. In this paper, we employ the MDL principle to identify the set of sequential patterns that summarises the data best. In particular, we formalise how to encode sequential data using sets of serial episodes, and use the encoded length as a quality score. As search strategy, we propose two approaches: the first algorithm selects a good pattern set from a large candidate set, while the second is a parameter-free any-time algorithm that mines pattern sets directly from the data. Experimentation on synthetic and real data demonstrates we efficiently discover small sets of informative patterns.

Topics: MDL principle.

42

Ugarte, Willy, Patrice Boizumault, Samir Loudni, Bruno Crémilleux, et al. (2014). Mining (soft-) skypatterns using dynamic CSP. In Proceedings of the International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, CPAIOR '14, volume 8451 of Lecture Notes in Computer Science, 71–87. Springer. doi:10.1007/978-3-319-07046-9_6. PDF. BibTeX.

@inproceedings{ugarte_mining_2014,
    author = {Ugarte, Willy and Boizumault, Patrice and Loudni, Samir and Crémilleux, Bruno and Lepailleur, Alban},
    title = {Mining (Soft-) Skypatterns Using Dynamic {CSP}},
    volume = {8451},
    doi = {10.1007/978-3-319-07046-9_6},
    series = {Lecture Notes in Computer Science},
    pages = {71--87},
    booktitle = {Proceedings of the International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, {CPAIOR} '14},
    publisher = {Springer},
    year = {2014}
}

Abstract.

Within the pattern mining area, skypatterns enable to express a user-preference point of view according to a dominance relation. In this paper, we deal with the introduction of softness in the skypattern mining problem. First, we show how softness can provide convenient patterns that would be missed otherwise. Then, thanks to Dynamic CSP, we propose a generic and efficient method to mine skypatterns as well as soft ones. Finally, we show the relevance and the effectiveness of our approach through a case study in chemoinformatics and experiments on UCI benchmarks.

Topics: constraints, skypatterns.

43

Ugarte, Willy, Patrice Boizumault, Samir Loudni, Bruno Crémilleux, et al. (2015). Soft constraints for pattern mining. Journal of Intelligent Information Systems, 44(2):193–221. doi:10.1007/s10844-013-0281-4. PDF. BibTeX.

@article{ugarte_soft_2015,
    author = {Ugarte, Willy and Boizumault, Patrice and Loudni, Samir and Crémilleux, Bruno and Lepailleur, Alban},
    title = {Soft constraints for pattern mining},
    volume = {44},
    doi = {10.1007/s10844-013-0281-4},
    pages = {193--221},
    number = {2},
    journal = {Journal of Intelligent Information Systems},
    year = {2015}
}

Abstract.

Constraint-based pattern discovery is at the core of numerous data mining tasks. Patterns are extracted with respect to a given set of constraints (frequency, closedness, size, etc). In practice, many constraints require threshold values whose choice is often arbitrary. This difficulty is even harder when several thresholds are required and have to be combined. Moreover, patterns barely missing a threshold will not be extracted even if they may be relevant. The paper advocates the introduction of softness into the pattern discovery process. By using Constraint Programming, we propose efficient methods to relax threshold constraints as well as constraints involved in patterns such as the top-k patterns and the skypatterns. We show the relevance and the efficiency of our approach through a case study in chemoinformatics for discovering toxicophores.

Topics: constraints, skypatterns.

44

Ulrich, Werner and Nicholas J. Gotelli (2012). A null model algorithm for presence–absence matrices based on proportional resampling. Ecological Modelling, 244:20–27. doi:10.1016/j.ecolmodel.2012.06.030. PDF. BibTeX.

@article{ulrich_null_2012,
    author = {Ulrich, Werner and Gotelli, Nicholas J.},
    title = {A null model algorithm for presence–absence matrices based on proportional resampling},
    volume = {244},
    doi = {10.1016/j.ecolmodel.2012.06.030},
    pages = {20--27},
    journal = {Ecological Modelling},
    year = {2012}
}

Abstract.

Ecological presence–absence matrices capture information of species occurrences among a number of sites. Statistical inference of matrix structure often used a fixed–fixed (FF) null model in which matrix entries are randomized, but the row and column total of each random matrix match those of the original matrix. However, in a stochastically assembled meta-community, row and column totals of a random assemblage might be expected to vary among matrices. Here we introduce a 4-step proportional–proportional (PP) algorithm that creates null matrices in which the row and column vary randomly, but the average row and column totals in a set of PP matrices are unbiased and match those of the original matrix. We tested the performance of the PP algorithm with 5 sets of artificial matrices and one large set of 288 published empirical matrices. Compared to the FF algorithm, the PP algorithm has better power to detect segregated and nested matrices, but it is vulnerable to Type I errors if row and column sums have small variances. The PP algorithm identified only 9% of empirical matrices as significantly segregated, compared with 30% identified by the traditional FF algorithm. The choice between whether to use the PP or the FF algorithm is similar to the distinction between random and fixed effects in a mixed-model ANOVA. For robust analysis, it may be desirable to use both the PP and the FF algorithms with the same data matrix.

Topics: randomization.

45

Webb, Geoffrey I. (2007). Discovering significant patterns. Machine Learning, 68(1):1–33. doi:10.1007/s10994-007-5006-x. PDF. BibTeX.

@article{webb_discovering_2007,
    author = {Webb, Geoffrey I.},
    title = {Discovering Significant Patterns},
    volume = {68},
    doi = {10.1007/s10994-007-5006-x},
    pages = {1--33},
    number = {1},
    journal = {Machine Learning},
    year = {2007}
}

Abstract.

Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.

Topics: interest measures.

46

Webb, Geoffrey I. (2011). Filtered-top-k association discovery. WIREs Data Mining and Knowledge Discovery, 1(3):183–192. doi:10.1002/widm.28. PDF. BibTeX.

@article{webb_filtered-top-k_2011,
    author = {Webb, Geoffrey I.},
    title = {Filtered-top-k association discovery},
    volume = {1},
    doi = {10.1002/widm.28},
    pages = {183--192},
    number = {3},
    journal = {{WIREs} Data Mining and Knowledge Discovery},
    year = {2011}
}

Abstract.

Association mining has been one of the most intensively researched areas of data mining. However, direct uptake of the resulting technologies has been relatively low. This paper examines some of the reasons why the dominant paradigms in association mining have not lived up to their promise, and argues that a powerful alternative is provided by top-k techniques coupled with appropriate statistical and other filtering.

Topics: interest measures.

47

Webb, Geoffrey I. and Jilles Vreeken (2013). Efficient discovery of the most interesting associations. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):15:1–15:31. doi:10.1145/2601433. PDF. BibTeX.

@article{webb_efficient_2013,
    author = {Webb, Geoffrey I. and Vreeken, Jilles},
    title = {Efficient Discovery of the Most Interesting Associations},
    volume = {8},
    doi = {10.1145/2601433},
    pages = {15:1--15:31},
    number = {3},
    journal = {{ACM} Transactions on Knowledge Discovery from Data ({TKDD})},
    year = {2013}
}

Abstract.

Self-sufficient itemsets have been proposed as an effective approach to summarizing the key associations in data. However, their computation appears highly demanding, as assessing whether an itemset is self-sufficient requires consideration of all pairwise partitions of the itemset into pairs of subsets as well as consideration of all supersets. This article presents the first published algorithm for efficiently discovering self-sufficient itemsets. This branch-and-bound algorithm deploys two powerful pruning mechanisms based on upper bounds on itemset value and statistical significance level. It demonstrates that finding top-k productive and nonredundant itemsets, with postprocessing to identify those that are not independently productive, can efficiently identify small sets of key associations. We present extensive evaluation of the strengths and limitations of the technique, including comparisons with alternative approaches to finding the most interesting associations.

Topics: interest measures.

48

Wille, Rudolf (2005). Formal concept analysis as mathematical theory of concepts and concept hierarchies. In Formal Concept Analysis: Foundations and Applications, Lecture Notes in Computer Science, pages 1–33. Springer. doi:10.1007/11528784_1. PDF. BibTeX.

@incollection{wille_formal_2005,
    author = {Wille, Rudolf},
    title = {Formal Concept Analysis as Mathematical Theory of Concepts and Concept Hierarchies},
    series = {Lecture Notes in Computer Science},
    pages = {1--33},
    booktitle = {Formal Concept Analysis: Foundations and Applications},
    publisher = {Springer},
    year = {2005},
    doi = {10.1007/11528784_1}
}

Abstract.

Formal Concept Analysis has been originally developed as a subfield of Applied Mathematics based on the mathematization of concept and concept hierarchy. Only after more than a decade of development, the connections to the philosophical logic of human thought became clearer and even later the connections to Piaget’s cognitive structuralism which Thomas Bernhard Seiler convincingly elaborated to a comprehensive theory of concepts in his recent book [Se01]. It is the main concern of this paper to show the surprisingly rich correspondences between Seiler’s multifarious aspects of concepts in the human mind and the structural properties and relationships of formal concepts in Formal Concept Analysis. These correspondences make understandable, what has been experienced in a great multitude of applications, that Formal Concept Analysis may function in the sense of transdisciplinary mathematics, i.e., it allows mathematical thought to aggregate with other ways of thinking and thereby to support human thought and action.

Topics: FCA.

49

Wille, Rudolf (2008). Formal concept analysis as applied lattice theory. In Proceedings of the International Conference on Concept Lattices and Their Applications, CLA '06, Lecture Notes in Computer Science, 42–67. Springer. doi:10.1007/978-3-540-78921-5_3. PDF. BibTeX.

@inproceedings{wille_formal_2008,
    author = {Wille, Rudolf},
    title = {Formal Concept Analysis as Applied Lattice Theory},
    doi = {10.1007/978-3-540-78921-5_3},
    series = {Lecture Notes in Computer Science},
    pages = {42--67},
    booktitle = {Proceedings of the International Conference on Concept Lattices and Their Applications, {CLA} '06},
    publisher = {Springer},
    year = {2008}
}

Abstract.

Formal Concept Analysis is a mathematical theory of concept hierarchies which is based on Lattice Theory. It has been developed to support humans in their thought and knowledge. The aim of this paper is to show how successful the lattice-theoretic foundation can be in applying Formal Concept Analysis in a wide range. This is demonstrated in three sections dealing with representation, processing, and measurement of conceptual knowledge. Finally, further relationships between abstract Lattice Theory and Formal Concept Analysis are briefly discussed.

Topics: FCA.

50

Wrobel, Stefan (1997). An algorithm for multi-relational discovery of subgroups. In Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, PKDD '97, volume 1263 of Lecture Notes in Computer Science, 78–87. Springer. doi:10.1007/3-540-63223-9_108. PDF. BibTeX.

@inproceedings{wrobel_algorithm_1997,
    author = {Wrobel, Stefan},
    title = {An Algorithm for Multi-relational Discovery of Subgroups},
    volume = {1263},
    doi = {10.1007/3-540-63223-9_108},
    series = {Lecture Notes in Computer Science},
    pages = {78--87},
    booktitle = {Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, {PKDD} '97},
    publisher = {Springer},
    year = {1997}
}

Abstract.

We consider the problem of finding statistically unusual subgroups in a multi-relation database, and extend previous work on single-relation subgroup discovery. We give a precise definition of the multi-relation subgroup discovery task, propose a specific form of declarative bias based on foreign links as a means of specifying the hypothesis space, and show how propositional evaluation functions can be adapted to the multi-relation setting. We then describe an algorithm for this problem setting that uses optimistic estimate and minimal support pruning, an optimal refinement operator and sampling to ensure efficiency and can easily be parallelized.

Topics: SD, IIa.

Fundamentals

Mining frequent itemsets

Selecting itemsets

I) Related formalisms

a) Skypatterns

b) Constraint programming (CP)

c) Formal Concept Analysis (FCA)

II) Problem variants

a) Contrast sets, Emerging Patterns and Subgroup discovery (SD)

b) Pattern Aided Regression (PXR) and Exceptional Model Mining (EMM)

c) Redescription mining

III) Advanced pattern evaluation techniques

a) Randomization

b) Minimum Description Length principle

c) Subjective Interestingness and Maximum Entropy models

Further reading