ACL Anthology Reference Corpus
This is the home page of the ACL Anthology Reference Corpus, a corpus of scholarly publications about Computational Linguistics. This corpus has two versions; both are canonicalized subsets of the ACL Anthology. The newer version includes all ACL Anthology files whose copyright belongs to the ACL (excluding COLING, LREC, etc.), up to December 2015, consisting of 22,878 articles. We hope this frozen corpus will be used for benchmarking applications for scholarly and bibliometric data processing.
- Version 20160301: This is the newer (v2.0) version, launched on 1 March 2016. It covers the PDFs, OCRed text (via Nuance Omnipage 16) and automatically extracted logical document structure with text and parsed citations via ParsCit. PDFs do include revisions and errata, where applicable.
- [ ACL XML Metadata ] - (1.8M) The XML metadata in ACL XML format, used to generate the ACL Anthology representation. See the tools and related links for the unofficial DTD.
- [ PDFs ] (7.8G) - The original PDFs from the ACL Anthology. Does not contain full volume files.
- [ OmniPage OCR XML ] - (2.0G) XML output from the commercial optical character recognition software, Nuance Omnipage.
- [ ParsCit structured XML ] - (356M) logical document structure and parsed citations information from ParsCit (v130908).
- Version 20090501: This is the old version distributed by the
Linguistic Data Consortium (LDC). This corpus is a
canonicalized subset of the ACL Anthology, up
to February 2007, consisting of 10,921 articles. This
version adds page images in both text and image forms,
as created by running OCR over the PDF files (Nuance
Omnipage 15 or 16). For detailed information about the
citation structure, do see the related project,
Anthology Author Network (see below).
[ DVD Disc 1 ] - Interlink data (clean), XML metadata (not clean, from Anthology), Image files in PNG format, text files from Omnipage in formatted and normal styles.
[ DVD Disc 2 ] - continuing PNG files.
[ DVD Disc 3 ] - remaining PNG files, PDFs from Anthology
[ DVD Disc 4 ] - remaining PDF files, text files from Omnipage in XML style.
If you have a copy of the ACL ARC from LDC, you may be missing some of the key files, files that give the textual dump of each page in three different formats. Here are a few quick links to the files:
- Version 20080325: This is the version described in the LREC paper that contains the canonical 10,921 computational linguistics papers as PDF and plain text files, with the associated metadata. (You can also email me to request a DVD copy of the corpus)
[ Complete tgz file from NUS ] [ Complete tgz file from Macquarie Univ. (courtesy Robert Dale) ] Warning, Huge! (4621149669 bytes, ~4.4 GB) Expect re-tries, use a client with resume capability
[ tgz file (without PDFs) ] (111001977 bytes, ~100MB)
- Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev and Yee Fan Tan (2008) The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proc. of Language Resources and Evaluation Conference (LREC 08). Marrakesh, Morocco, May.
[ .pdf pre-print ] [ Slides (.htm) ]
Members for the 2016 versions.
- Min-Yen Kan - Project leader, National University of Singapore
- Muthu Kumar Chandrasekaran, National University of Singapor
Members for the previous 2009 versions.
- Min-Yen Kan - Project leader, National University of Singapore
- Steven Bird, University of Melbourne
- Robert Dale, Macquarie University
- Bonnie Dorr, University of Maryland
- Bryan Gibson, University of Michigan
- Mark Joseph, University of Michigan
- Dongwon Lee, Pennsylvania State University
- Brett Powley, Macquarie University
- Dragomir Radev, University of Michigan
- Yee Fan Tan, National University of Singapore
Links to information about the corpus itself, alternative and related corpora and specific tools to process it. The below pertain to the earlier versions from 2009.
- A DTD for the XML metadata, contributed by Martin Helmhout.
- An updated fileList.txt and accompanying notes, contributed by Martin Helmhout.
- A parser for text sectioning and segmentation of ACL ARC (~14 MB) into the SEPID ARC format, contributed by Behrang Qasemizadeh. The authors kindly request that if you use this tool that you cite their work in: Behrang Qasemizadeh, Paul Buitelaar, Fergal Monaghan, Developing a Dataset for Technology Structure Mining, 4th IEEE ICSC, 2010.
- Looglefight.com: A concordance searcher / search engine for the ACL ARC. Thanks to Adam Kilgariff, Jan Pomikálek and Girish for creating this resource.
- ACL Reference Dataset for Terminology Extraction and Classification (ACL RD-TEC 2.0) created by Anne-Kathrin Schumann and Behrang QasemiZadeh. ACL RD-TEC 2.0 consists of 300 abstracts from ACL ARC which are manually annotated for terms in context. In these abstracts, terms (i.e., single or multi-word lexical units with a specialised meaning) and their semantic classes (i.e., technologies, systems, language resources, language resources (specific product), models, measure and measurement, as well as a class label for residuals) are marked by two annotators.
- ACL RD-TEC 1.0, also created by Behrang Qasemizadeh. It is a data set of manually annotated terms for benchmarking and evaluation of automatic term recognition algorithms from publications in the domain of computational linguistics. In its first release, the dataset consists of more than 80,000 manually annotated candidate terms; these candidate terms are annotated either as valid, invalid or technology terms. In short, "technology terms" are those computational linguistics jargon that signal processes, method and algorithms; these terms signify practical solutions to the problems that are addressed in computational linguistics.
Here we list some related tools for bibliographic processing, and related sites for bibliographic research.
- The corpus description and ordering information at LDC. Note that the corpus is also free of charge from this website.
- ACL Anthology Network: A parallel initiative at the University of Michigan to construct a social network graph of researchers in computational linguistics.
- CLAIRLIB: A set of modules for NLP, including a set of tools for bibliometric analysis.
- ACL Anthology: The current version of the ACL Anthology, from which the ACL ARC is derived from.
- ParsCit: A tool to automatically perform reference string parsing as well as logical document structure recovery. Created by the folks at WING of National University of Singapore (NUS).
Our efforts have been supported by the grassroots initiative call made by the ACL Exec at the ACL annual 2007 meeting in Prague. We would like to acknowledge the support of the ACL Exec in encouraging this form of collaboration.
Thanks also go to Behrang Qasemizadeh, PhD student in the Unit for Natural Language Processing, Digital Enterprise Research Institute of the National University of Ireland, Galway (funded by Science Foundation Ireland) for his work on the SEPID ARC format and to Martin Helmout of Southampton for his work on proofchecking the files and schema of the XML files.