Newspaper data set

The data set was selected from different news websites during November and December 2015. It contains two sub sets: English and Finnish articles. The ground truth keywords are the vocabularies chosen by the article editors.


English subset

Sources: Indianexpress (330), Macworld (220), The guardian (421), University herald (300).

Topics: Business, cities, entertainment, news, politics, art & culture, sports, health & life style, trending, world, technology, education, environment, media, finance, travel, and others.

[English dataset]


Finnish subset

Sources: Kaksplus (200), Kotiliesi (210), Ruoka.fi (200), Taloussanomat (210), Urheilu (200), Uusi Suomi (200).

Topics: Business, cities, entertainment, news, politics, art & culture, sports, health & life style, trending, world, technology, education, environment, media, finance, travel, and others.

[Finnish dataset]


Mopsi data

Speech and Image Processing Unit


Disclaimer: The data might contain copyrighted material and should be only used for scientific research.