Extracting Representative Image from web page
Najlah Gali, Andrea Tabarcea, Pasi Fränti
Int. Conf. on Web Information Systems & Technologies (WEBIST'15), 411-419, Lisbon, Portugal, May 2015.

WebIma is a tool for selecting one representative image from a web page. The method uses simple heuristics based on the following rules: image size, aspect ratio, file type, title and alt attributes, location of the image on the HTML structure, and matching keywords in image path, title and header tags. The benefits of the method are:

  • Lightweight: do not need to download images
  • Unsupervised: no training needed
  • Performance: better than Facebook and Google+ (as tested in Sept-2014)

  • More details can be found in the paper:
    [PAPER] [PPT]

    WebIma demo:
    Demo Tool

    WebIma dataset (2014):
    Websites:   1002
    Images:      2363
    Per page:    Min=1,  Average=2.36,  Max=154
    Domains:    64

    fi (459) com (242) org (38) de (23) co.uk (21) is (19) net (16) hu (16) at (13) cz (11) org.uk (10) no (10) se (7) it (7) fr (7) be (7) ch (6) ie (6) nl (5) ee (5) ru (4) com.au (4) gc.ca (4) ae (3) ro (3) dk (3) ac.uk (3) eu (3) gov (3) org.mx (2) info (2) ca (4) on.ca (2) gob.ar (2) com.br (2) lu (2) gob.pe (1) br (1) edu.hk (1) co.nz (1) gov.eg (1) ac.in (1) gov.in (1) es (1) com.bo (1) com.tr (1) edu.cn (1) ac.za (1) pe (1) ec (1) ac.jp (1) co (1) ip (1) com.sg (1) ac (1) edu.kh (1) bc.ca (1) biz (1) me (1) pl (1) edu.au (1) uk (1) gov.uk (1) gov.at (1)

    Collection details:
    Who:         117 volunteers
    When:       September 2014
    What:        Pages of own choice or Mopsi search
    How:         Select 1-3 most representative images
    Issues:       Some level of subjectivity unavoidable

    [Ground truth]

    [Dataset]

    The dataset contains 860 files only. Downloading the websites has not been done simultaneously with images collection. Therefore, some websites changed their images content at the time of downloading and those are removed from the set. Other websites are prohibited from being downloaded. Two excel sheets are provided with the data files listing these websites.

    WebIma 2.0 was generated later in 2021 from the subsets with OSM additions.

    Example: Ravintola Kreeta

    Some of images found

    Ground truth

    Data collection:
    Data collection tool


    Disclaimer: The data might contain copyrighted material and should not be used for any purpose other than data analysis for scientific research.