Projects
Corpora, models and tools — mostly around poetry, emotion and digital ecocriticism.
GitHub — @tnhaider ↗
- Deutsches Lyrik Korpus (DLK) ↗
A large corpus of New High German poetry, built by collecting and parsing the bulk of digitized public-domain German verse — chiefly the Deutsches Textarchiv and the TextGrid Digital Library. The current v6 holds ~65,700 poems by ~250 authors (over 12 million tokens), segmented to syllable level. It underpins much of my distant-reading work on verse.
- PO-EMO — Aesthetic Emotions in Poetry ↗
A corpus of German and English poetry annotated line-by-line for the aesthetic emotions it evokes (beauty/joy, sadness, suspense, awe, and more), with prosody annotation and modeling code — 158 German and 64 English poems double-annotated, plus a Chinese pilot. The dataset behind the PO-EMO paper.
- Metrical Tagging in the Wild ↗
German and English poetry corpora annotated with prosodic features — syllable stress, meter, measure, caesura — together with the neural taggers trained on them. Code and data for the EACL 2021 paper.
- German Rhyme Corpus ↗
A diachronically balanced sample of German poetry, manually annotated for rhyme — notably, about a third of stanzas don’t rhyme at all. Encoded in TEI P5; the resource behind the supervised rhyme-detection paper (SIGHUM 2018).
- Poetry Corpus Building ↗
Tools for scraping, cleaning, and assembling large poetry corpora from heterogeneous sources.
- English Gutenberg Poetry ↗
An English poetry corpus mined from Project Gutenberg (via GutenTag) and prepared for computational analysis.
- Poetry Gold ↗
A diachronically balanced, hand-selected gold corpus of English poetry for evaluating automatic analysis.
- Antikörperchen Poetry Corpus ↗
Canonical German poems paired with student interpretations and several annotation layers, crawled from the Antikörperchen poetry site. (See also the related “antik” repository.)
- EPG64 — Annotated English Poetry ↗
A small reference set of English poems with manual gold annotation.
- XML Poetry Tooling ↗
Parsers for reading poetry encoded in TEI/DTA-conforming XML into analysis-ready formats. (See also “poetry-api-xml-dta”.)
- Narrator Semantic Change ↗
Data and code behind “The Ongoing Birth of the Narrator” (DSH 2025): annotation guidelines, the gold-labeled instances, and the model folds used to trace the emergence of the author–narrator distinction in literary criticism.
- NYT Fiction Bestseller List ↗
A dataset of titles and authors on the New York Times fiction bestseller list, 2000–2020 — the basis for the “More Social, Less Religious” trend study.
- Literotica Corpus ↗
A corpus of over 110,000 erotic fan-fiction documents crawled from literotica.com, organized by user rating and genre — assembled for computational study of genre, style, and narrative.
- Web Crawlers ↗
A set of web crawlers used to build text corpora — for the Antikörperchen poetry site, Literotica, and MGG-Online (musicology).
- DHd2024 Book of Abstracts ↗
Scripts that turn TEI-encoded XML into the printed PDF Book of Abstracts for the DHd2024 conference in Passau, which I co-organized (adapted from a TEI-to-PDF pipeline).
GitLab · Universität Passau — @haider ↗
- Bundestagsprotokolle ↗
A corpus and processing pipeline for German parliamentary proceedings (Bundestag Plenarprotokolle), underpinning our work on linking parliamentary speakers to Wikidata and on German nuclear-energy discourse. (A collaboration with M. Teich; repository internal.)
- DHd Citation Network ↗
Citation, co-authorship and keyword network analysis of the DHd conference abstracts (2014–2023) — the code behind “Quo Vadis Fachbereiche und Schulen der DHd.” (A collaboration with S. Gassner; repository internal.)
- Multilingual Valence Scaling ↗
A curated collection of psycholinguistic valence (affective pleasantness) rating datasets — 37 datasets across 13 languages, mapped to Concepticon via NoRaRe — used to evaluate LLM valence scaling with Best-Worst Scaling across 16 models (0.5B–671B parameters). The basis of the forthcoming valence-norms paper.
- Ecocriticism Bibliography Pipeline ↗
A pipeline that extracts plain text from PDFs and uses a local Mistral model to pull book metadata — and, soon, literary mentions — from a corpus of secondary literature on ecocriticism, toward mapping the “ecocritical canon.”
- EcoCor — Corpus Extraction ↗
Code and data for retrieving entities and texts from the EcoCor API — part of the open EcoCor infrastructure for digital ecocriticism introduced at DH2026.
- Enjambement in German & Spanish ↗
Code and data for detecting enjambement — line breaks that run against syntax — in German and Spanish poetry. Supports the forthcoming CHR paper. (Repository is internal.)
- EcoCor — Modeling ↗
Modeling and prediction over the EcoCor ecocriticism corpus. (Repository is internal.)
- EcoHack Modeling ↗
Modeling experiments around ecocriticism and the environmental humanities. (Repository is internal.)
- Valence Scaling ↗
Earlier experiments on automatically scaling affective (valence) norms for words. (Repository is internal.)
- hopecore ↗
(Description to come.)