Law in British Periodicals is a collaborative digital humanities project investigating the representation of law, legal language, and heteroglossic discourse in 18th- and 19th-century British periodical literature. The project combines archival research with computational methods to trace how legal vocabulary, genres, and arguments circulated across the periodical press.
| Name | Institution | Role |
|---|---|---|
| Clifford B. Anderson | Yale University | |
| Corey Brady | Southern Methodist University | |
| Sophie Hao | Boston University | |
| Mark Schoenfield | Vanderbilt University |
All datasets are private due to contractual restrictions. Each year in the corpus follows a three-stage structure:
| Dataset | Description |
|---|---|
LawInBritishPeriodicals/[year] |
Source PDFs, one row per document |
LawInBritishPeriodicals/[year]-images |
Page images rasterized at 150 DPI, one row per page, with source_file, page_number, and total_pages metadata |
LawInBritishPeriodicals/[year]-ocr |
OCR transcriptions in Markdown format, generated with GLM-OCR |
LawInBritishPeriodicals/[year]-classified |
Topic classifications with confidence scores, generated with Qwen2.5-7B-Instruct |
Current holdings: 1770 · 1811
The project uses a fully reproducible, GPU-accelerated processing pipeline built on HuggingFace Jobs and uv scripts.
PDF documents
│
▼ pdf_to_images.py (CPU)
Page images (150 DPI PNG)
│
▼ glm-ocr-v2.py (L4 GPU)
OCR transcriptions (Markdown)
│
▼ classify_topics.py (L4 GPU)
Topic classifications (JSON with labels + confidence scores)
Pipeline scripts are stored at LawInBritishPeriodicals/scripts. Full documentation is included in that repository's README.
Transcription uses GLM-OCR (zai-org, MIT License), a 0.9B parameter multimodal OCR model achieving 94.62% on OmniDocBench v1.5 — currently #1 overall. It handles 18th–19th century printed English effectively and supports multilingual output.
Classification uses Qwen2.5-7B-Instruct with a project-specific taxonomy:
| Label | Scope |
|---|---|
legal |
Statutes, trials, legal commentary, court reports |
dramatic |
Theatre reviews, playbills, dramatic criticism |
parliamentary |
Parliamentary proceedings, political speeches, elections |
commercial |
Trade, prices, shipping, finance, advertisements |
literary |
Poetry, fiction, literary criticism, essays |
religious |
Sermons, moral philosophy, ecclesiastical affairs |
natural |
Natural history, medicine, science, technology |
other |
Does not fit any of the above |
Pages may receive multiple labels. Each label carries a confidence score (0.0–1.0) and a one-sentence rationale.
| Space | Description |
|---|---|
LawInBritishPeriodicals/1770-dashboard |
Interactive topic classification dashboard (label frequency, score distributions, co-occurrence heatmap, timeline) |