Titler Corpus

Article citation

N. Gali, R. Mariescu-Istodor and P. Fränti, "Using linguistic features to automatically extract web page title", Expert Systems with Applications, 79, 296-312, 2017

More info on dataset here (including PDF of above)

Download corpus archive

.tar.gzp-archive, with HTML-files inside (2.1 GB)

Current sites in corpus

These are only the names of the sites. Archive above contains HTML-files.