Photo description clusters




The dataset consists of 2500 photo descriptions. From these, 15 clusters were manually extracted using the selected keywords
to match the given query word. The clusters contain 389 photos. Examples are shown below. If the search keywords were known by
the algorithm, the clusters could be almost perfectly reconstructed. However, without knowing these words, the dataset can serve as a benchmark for testing clustering algorithms.

Some notes:
(1) Besides the description, the photos also have locations;
(2) Some of the photos could belong to several clusters;
(3) It is possible the data might have more clusters than those selected 15;
(4) The photos shown below are not always the same as in data; (keyword is same but location may not)
(5) In the original paper only 180 photos were used but we do not remember anymore how this subset was selected

Image Image Image Image Image Image Image Image Image Image Image Image Image Image Image

Summary of the 15 clusters

Table below shows the statistics of the 15 clusters. The clusters were created by selecting the photos that matched
one of the keywords with some exceptions. For example, "Ice cream" was not selected to the Ice cluster (12).

ClusterPhotosKeyword
133Maisem
28Hiihto
315Juoksu
4107Talo
56Auto
643Hotelli / Hotel
753Kahvi / Kahvila / Cafe / Kafe / Kaffe
845Street
99Shakki / Chess
109Ravintola
112Avanto
124Ice
1315Garden
1423Kirkko
1517Lake
Total389
-12111Non-clustered

Example how to use

The data has been used to measure clustering accuracy using agglomerative clustering maximizing total
pairwise similarity within the clusters. Various string similary measures were used to see which one
provides most accurate clustering. Centroid similarity index was used for accuracy (0%: worst, 100%: perfect).

Clustering results

Clustering prototype tool:
Clustering proto

Dataset

The data includes: geolocation, 1st cluster index, 2nd cluster index (if any), text description.
Download data here:

Dataset

Disclaimer: The data can be freely used for research purposes as long as the paper is cited.