Clustering Methods

Clustering Methods (5 cp) 3621552

Course description

Clustering is a basic tool used in data analysis, pattern recognition and machine learning for finding groups in data. K-means is still the most popular algorithm in clustering. But is it good enough? How to decide how many clusters? What if the data is non-numerical like categorical, graph, text or more complex objects like GPS trajectories? Outliers, noise and missing values also degrade the clustering performance so how to deal with these problems? Besides these problems, clustering problem is like other optimization problems. It consists of the following three main design problems: (1) define distance function suitable for the data, (2) select cost function to measure goodness of the clusters, (3) design algorithm to optimize for the cost function.

Teaching methods

Course will be arranged as a series of (1) Youtube video lectures; (2) exercises every Friday; (3) Series of mini-exams (to be implemented later) or classical 4 hour offline exam. Students will also be required to implement clustering program that will be gradually extended during the exercises. Suitable programming languages are Python, C, C++, C#, Java, JavaScript, R, Matlab, PHP, Go, Ruby.

Teachers

Lecturer: Pasi Fränti
Course assistant: Jimi Tuononen

Intro

Schedule

Start-up lecture Tuesday 16.1. at 14-16 Teams
Video lectures (~28h): Youtube
Exercises (7 or 8): Tuesday 10-12: Teams

16.1. Introduction to clustering (START-UP lecture on-line)

Exercise sessions:
23.1. Introduction to clustering and terminology
30.1. K-means, Fast k-means, Random swap
6.2. Graph clustering, Mumford-Shah k-means
13.2. Cost functions, text clustering, clustering of web pages
20.2. Clustering evaluation, outlier detection
27.2. Number of clusters, location-based data
5.3. Divisive clustering, Genetic algorithm
12.3. Either "Density peaks, case study" or "Agglomerative clustering (on-line + discussion)"

The date indicates that the topic will be discussed on that day's exercise session. The given material (video and powerpoint) and exercises should be studied before that day.

Video lectures

Intro (ppt): part 1, part 2
K-means (ppt)
Fast k-means (ppt)
Random swap (ppt)
Mumford-Shah k-means (ppt)
Graph clustering (ppt)
Cost functions: (ppt) part 1, part 2
Text clustering
Clustering web pages (ppt)
Cluster evaluation (ppt)
Centroid Index (ppt) (pdf)
Mean-shift Outlier detection (ppt)
Outlier detection (ppt)
Number of clusters: (ppt) part 1, part 2
Location-based data (ppt)
Agglomerative clustering (ppt)
Divisive algorithms (ppt) (pdf)
Genetic algorithm (ppt)
Density peaks (ppt)
Case study (ppt)

All lectures in YouTube

Exercises

Exercise 1
Exercise 2
Exercise 3
Exercise 4
Exercise 5
Exercise 6
Exercise 7

Submit your exercises in Elearn (key: "clustering")

Preliminary knowledge

Design & Analysis of Algorithms

Exams

15.3. 12-16, Room M100 (Joensuu), Room MD100 (Kuopio)
26.4. 12-16, Room M100 (Joensuu), Room MD100 (Kuopio)

Links

Clustering data
Software
Clusterator
Animator
Lectures Notes and material from 2014