N. Gali, R. Mariescu-Istodor and P. Fränti, "Using linguistic features to automatically extract web page title", Expert Systems with Applications, 79, 296-312, 2017. [pdf]

The data set was collected from Google Maps search results.

Collection details:

Websites:	1002
Language:	English
When:	July 2014 & April 2015
What:	Pages retrieved from search engine
How:	By input search query: hospital, pharmacy, sport, fitness, swimming pool, bowling alley, spa, sauna, cinema, pub, bar, auto repair, hotel, restaurant, cafe, and pizza
Region:	US, UK, Canada, Australia, New zealand, Ireland
Issues:	Some level of subjectivity in title tagging unavoidable

[Dataset (archived .tar.gz)]

[Text files]

[Ground truth for titles]

Mopsi data

Speech and Image Processing Unit

Disclaimer: The data might contain copyrighted material and should be used only for scientific research.