WebIma is a tool for selecting one representative image from a web page. The method uses simple heuristics based on the following rules: image size, aspect ratio, file type, title and alt attributes, location of the image on the HTML structure, and matching keywords in image path, title and header tags. The benefits of the method are:

Lightweight: do not need to download images

Unsupervised: no training needed

Performance: better than Facebook and Google+ (as tested in Sept-2014)

More details can be found in the paper:

[PAPER] [PPT]

WebIma demo:

Demo Tool

WebIma dataset (2014):

Websites: 1002

Images: 2363

Per page: Min=1, Average=2.36, Max=154

Domains: 64

fi (459) com (242) org (38) de (23) co.uk (21) is (19) net (16) hu (16) at (13) cz (11) org.uk (10) no (10) se (7) it (7) fr (7) be (7) ch (6) ie (6) nl (5) ee (5) ru (4) com.au (4) gc.ca (4) ae (3) ro (3) dk (3) ac.uk (3) eu (3) gov (3) org.mx (2) info (2) ca (4) on.ca (2) gob.ar (2) com.br (2) lu (2) gob.pe (1) br (1) edu.hk (1) co.nz (1) gov.eg (1) ac.in (1) gov.in (1) es (1) com.bo (1) com.tr (1) edu.cn (1) ac.za (1) pe (1) ec (1) ac.jp (1) co (1) ip (1) com.sg (1) ac (1) edu.kh (1) bc.ca (1) biz (1) me (1) pl (1) edu.au (1) uk (1) gov.uk (1) gov.at (1)

Collection details:

Who: 117 volunteers

When: September 2014

What: Pages of own choice or Mopsi search

How: Select 1-3 most representative images

Issues: Some level of subjectivity unavoidable

[Ground truth]

[Dataset]

The dataset contains 860 files only. Downloading the websites has not been done simultaneously with images collection. Therefore, some websites changed their images content at the time of downloading and those are removed from the set. Other websites are prohibited from being downloaded. Two excel sheets are provided with the data files listing these websites.

WebIma 2.0 was generated later in 2021 from the subsets with OSM additions.

Example: Ravintola Kreeta

Some of images found

Ground truth

Data collection:

Data collection tool

Disclaimer: The data might contain copyrighted material and should not be used for any purpose other than data analysis for scientific research.