The dataset consists of 2500 photo descriptions. From these, 15 clusters were manually extracted using the selected keywords
to match the given query word. The clusters contain 389 photos. Examples are shown below. If the search keywords were known by
the algorithm, the clusters could be almost perfectly reconstructed. However, without knowing these words, the dataset can serve
as a benchmark for testing clustering algorithms.
Some notes:
(1) Besides the description, the photos also have locations;
(2) Some of the photos could belong to several clusters;
(3) It is possible the data might have more clusters than those selected 15;
(4) The photos shown below are not always the same as in data; (keyword is same but location may not)
(5) In the original paper only 180 photos were used but we do not remember anymore how this subset was selected
Table below shows the statistics of the 15 clusters. The clusters were created by selecting the photos that matched
one of the keywords with some exceptions. For example, "Ice cream" was not selected to the Ice cluster (12).
Cluster | Photos | Keyword |
---|---|---|
1 | 33 | Maisem |
2 | 8 | Hiihto |
3 | 15 | Juoksu |
4 | 107 | Talo |
5 | 6 | Auto |
6 | 43 | Hotelli / Hotel |
7 | 53 | Kahvi / Kahvila / Cafe / Kafe / Kaffe |
8 | 45 | Street |
9 | 9 | Shakki / Chess |
10 | 9 | Ravintola |
11 | 2 | Avanto |
12 | 4 | Ice |
13 | 15 | Garden |
14 | 23 | Kirkko |
15 | 17 | Lake |
Total | 389 | |
-1 | 2111 | Non-clustered |
The data has been used to measure clustering accuracy using agglomerative clustering maximizing total
pairwise similarity within the clusters. Various string similary measures were used to see which one
provides most accurate clustering. Centroid similarity index
was used for accuracy (0%: worst, 100%: perfect).
The data includes: geolocation, 1st cluster index, 2nd cluster index (if any), text description.
Download data here:
Disclaimer: The data can be freely used for research purposes as long as the paper is cited.