Fränti, P., Järviö, J., Salimi, M., Taipale, I., Laitinen, M., Albicker, R., Nie, C., Fatemi, M., & Rautionaho, P. (2025). Beyond names: How to label gender automatically in CMC data? 12th International Conference on CMC and Social Media Corpora for the Humanities. CMC-Corpora, University of Bayreuth, Germany.
Code available at https://github.com/comet-uef/
Download comet-gender-dataset.jsonl.gz
Note: Compressed file size ~30MB, total file size is >100 MB.
The dataset is in JSONL format: each line is a JSON object for one user. Contains 343k lines.
Main fields:
id
— Unique user ID. This is not the Twitter ID given by the API.
name
— Profile name
description
— Profile description
location
— Profile location. Free text field,
not any actual geographical location.
predicted_gender
— Final predicted gender,
combined result of the different classification methods
has_names
, has_pronouns
, has_keywords
,
has_manual_label
— Flags for available classification types
name_classification
— Name-based gender prediction details
pronoun_classification
— Pronoun-based gender prediction details
keyword_classification
— Keyword-based gender prediction details
manual_classification
— Manual label details