WebIma is a tool for selecting one representative image from a web page. The method uses simple heuristics based on the following rules: image size, aspect ratio, file type, title and alt attributes, location of the image on the HTML structure, and matching keywords in image path, title and header tags. The benefits of the method are:
fi (459) com (242) org (38) de (23) co.uk (21) is (19) net (16) hu (16) at (13) cz (11) org.uk (10) no (10) se (7) it (7) fr (7) be (7) ch (6) ie (6) nl (5) ee (5) ru (4) com.au (4) gc.ca (4) ae (3) ro (3) dk (3) ac.uk (3) eu (3) gov (3) org.mx (2) info (2) ca (4) on.ca (2) gob.ar (2) com.br (2) lu (2) gob.pe (1) br (1) edu.hk (1) co.nz (1) gov.eg (1) ac.in (1) gov.in (1) es (1) com.bo (1) com.tr (1) edu.cn (1) ac.za (1) pe (1) ec (1) ac.jp (1) co (1) ip (1) com.sg (1) ac (1) edu.kh (1) bc.ca (1) biz (1) me (1) pl (1) edu.au (1) uk (1) gov.uk (1) gov.at (1)
The dataset contains 860 files only. Downloading the websites has not been done simultaneously with images collection. Therefore, some websites changed their images content at the time of downloading and those are removed from the set. Other websites are prohibited from being downloaded. Two excel sheets are provided with the data files listing these websites.
WebIma 2.0 was generated later in 2021 from the subsets with OSM additions.
Disclaimer: The data might contain copyrighted material and should not be used for any purpose other than data analysis for scientific research.