Fränti, P., Järviö, J., Salimi, M., Taipale, I., Laitinen, M., Albicker, R., Nie, C., Fatemi, M., & Rautionaho, P. (2025). Beyond names: How to label gender automatically in CMC data? 12th International Conference on CMC and Social Media Corpora for the Humanities. CMC-Corpora, University of Bayreuth, Germany.
Code available at https://github.com/comet-uef/
            Download comet-gender-dataset.jsonl.gz
			
			Note: Compressed file size ~30MB, total file size is >100 MB.
        
The dataset is in JSONL format: each line is a JSON object for one user. Contains 343k lines.
Main fields:
			id — Unique user ID. This is not the Twitter ID given by the API.
			name — Profile name
			description — Profile description
			location — Profile location. Free text field, 
				not any actual geographical location.
			predicted_gender — Final predicted gender, 
				combined result of the different classification methods
			has_names, has_pronouns, has_keywords,
				has_manual_label — Flags for available classification types
			name_classification — Name-based gender prediction details
			pronoun_classification — Pronoun-based gender prediction details
			keyword_classification — Keyword-based gender prediction details
			manual_classification — Manual label details