Comet - Gender detection CMC 2025

Reference

Fränti, P., Järviö, J., Salimi, M., Taipale, I., Laitinen, M., Albicker, R., Nie, C., Fatemi, M., & Rautionaho, P. (2025). Beyond names: How to label gender automatically in CMC data? 12th International Conference on CMC and Social Media Corpora for the Humanities. CMC-Corpora, University of Bayreuth, Germany.

Code

Code available at https://github.com/comet-uef/

Data

Download comet-gender-dataset.jsonl.gz
Note: Compressed file size ~30MB, total file size is >100 MB.

The dataset is in JSONL format: each line is a JSON object for one user. Contains 343k lines.

Main fields:

id — Unique user ID. This is not the Twitter ID given by the API.
name — Profile name
description — Profile description
location — Profile location. Free text field, not any actual geographical location.
predicted_gender — Final predicted gender, combined result of the different classification methods
has_names, has_pronouns, has_keywords, has_manual_label — Flags for available classification types
name_classification — Name-based gender prediction details
pronoun_classification — Pronoun-based gender prediction details
keyword_classification — Keyword-based gender prediction details
manual_classification — Manual label details