Mopsi 2500 Photos

The dataset consists of 2500 photo descriptions. From these, 15 clusters were manually extracted using the selected keywords
to match the given query word. The clusters contain 389 photos. Examples are shown below. If the search keywords were known by
the algorithm, the clusters could be almost perfectly reconstructed. However, without knowing these words, the dataset can serve as a benchmark for testing clustering algorithms.

Some notes:
(1) Besides the description, the photos also have locations;
(2) Some of the photos could belong to several clusters;
(3) It is possible the data might have more clusters than those selected 15;
(4) The photos shown below are not always the same as in data; (keyword is same but location may not)
(5) In the original paper only 180 photos were used but we do not remember anymore how this subset was selected

Summary of the 15 clusters

Table below shows the statistics of the 15 clusters. The clusters were created by selecting the photos that matched
one of the keywords with some exceptions. For example, "Ice cream" was not selected to the Ice cluster (12).

Cluster Photos Keyword

1 33 Maisem

2 8 Hiihto

3 15 Juoksu

4 107 Talo

5 6 Auto

6 43 Hotelli / Hotel

7 53 Kahvi / Kahvila / Cafe / Kafe / Kaffe

8 45 Street

9 9 Shakki / Chess

10 9 Ravintola

11 2 Avanto

12 4 Ice

13 15 Garden

14 23 Kirkko

15 17 Lake

Total 389

-1 2111 Non-clustered

Cluster	Photos	Keyword
1	33	Maisem
2	8	Hiihto
3	15	Juoksu
4	107	Talo
5	6	Auto
6	43	Hotelli / Hotel
7	53	Kahvi / Kahvila / Cafe / Kafe / Kaffe
8	45	Street
9	9	Shakki / Chess
10	9	Ravintola
11	2	Avanto
12	4	Ice
13	15	Garden
14	23	Kirkko
15	17	Lake
Total	389
-1	2111	Non-clustered

Example how to use

The data has been used to measure clustering accuracy using agglomerative clustering maximizing total
pairwise similarity within the clusters. Various string similary measures were used to see which one
provides most accurate clustering. Centroid similarity index was used for accuracy (0%: worst, 100%: perfect).

Clustering results

Clustering prototype tool:

Dataset

The data includes: geolocation, 1st cluster index, 2nd cluster index (if any), text description.
Download data here:

Dataset

Disclaimer: The data can be freely used for research purposes as long as the paper is cited.

Photo description clusters

Summary of the 15 clusters

Example how to use

Dataset