ACL Anthology Reference Corpus (ACL ARC)
to the ACL home page ]
to WING ]
This is the home page of the ACL Anthology Reference Corpus, a
corpus of scholarly publications about Computational Linguistics.
This corpus is a canonicalized subset of the ACL Anthology, up to
February 2007, consisting of 10,921 articles. We hope this frozen
corpus will be used for benchmarking applications for scholarly and
bibliometric data processing.
Download the corpus
- Version 20090501: This is the version distributed by the Linguistic Data Consortium (LDC). This version adds page images in both text and image forms, as created by running OCR over the PDF files (Nuance Omnipage 15 or 16). For detailed information about the citation structure, do see the related project, Anthology Author Network (see below).
[ DVD Disc 1 ] - Interlink data (clean), XML metadata (not clean, from Anthology), Image files in PNG format, text files from Omnipage in formatted and normal styles.
[ DVD Disc 2 ] - continuing PNG files.
[ DVD Disc 3 ] - remaining PNG files, PDFs from Anthology
[ DVD Disc 4 ] - remaining PDF files, text files from Omnipage in XML style.
If you have a copy of the ACL ARC from LDC, you may be missing some of the key files, files that give the textual dump of each page in three different formats. Here are a few quick links to the files:
- Version 20080325: This is the version described in the LREC paper that contains the canonical 10,921 computational linguistics papers as PDF and plain text files, with the associated metadata. (You can also email me to request a DVD copy of the corpus)
[ Complete tgz file from NUS ] [ Complete tgz file from Macquarie Univ. (courtesy Robert Dale) ] Warning, Huge! (4621149669 bytes, ~4.4 GB) Expect re-tries, use a client with resume capability
[ tgz file (without PDFs) ] (111001977 bytes, ~100MB)
- Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev and Yee Fan Tan (2008) The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proc. of Language Resources and Evaluation Conference (LREC 08). Marrakesh, Morocco, May.
[ .pdf pre-print ]
[ Slides (.htm) ]
- Min-Yen Kan - Project leader, National University of Singapore
- Steven Bird, University of Melbourne
- Robert Dale, Macquarie University
- Bonnie Dorr, University of Maryland
- Bryan Gibson, University of Michigan
- Mark Joseph, University of Michigan
- Dongwon Lee, Pennsylvania State University
- Brett Powley, Macquarie University
- Dragomir Radev, University of Michigan
- Yee Fan Tan, National University of Singapore
Tools and Related Links
Links to information about the corpus itself, alternative and related corpora and specific tools to
- A DTD for the XML metadata, contributed by Martin Helmhout.
- An updated fileList.txt and accompanying notes, contributed by Martin Helmhout.
- A parser for text sectioning and segmentation of ACL ARC (~14 MB) into the (updated link -- new as of 1 Dec 2014!) SEPID ARC format, contributed by Behrang Qasemizadeh.
The authors kindly request that if you use this tool that you cite their work in:
Behrang Qasemizadeh, Paul Buitelaar, Fergal Monaghan, Developing a Dataset for Technology Structure Mining, 4th IEEE ICSC, 2010.
- Looglefight.com: A concordance searcher / search engine for the ACL ARC. Thanks to Adam Kilgariff, Jan Pomikálek and Girish for creating this resource.
- (updated link -- new as of 1 Dec 2014!) ACL Reference Dataset for Terminology Extraction and Classification (ACL RD-TEC), also created by Behrang Qasemizadeh. It is a data set of manually annotated terms for benchmarking and evaluation of automatic term recognition algorithms from publications in the domain of computational linguistics. In its first release, the dataset consists of more than 80,000 manually annotated candidate terms; these candidate terms are annotated either as valid, invalid or technology terms. In short, "technology terms" are those computational linguistics jargon that signal processes, method and algorithms; these terms signify practical solutions to the problems that are addressed in computational linguistics.
Here we list some related tools for bibliographic processing, and related sites for bibliographic research.
- The corpus description and ordering information at LDC. Note that the corpus is also free of charge from this website.
- ACL Anthology Network: A parallel initiative at the University of Michigan to construct a social network graph of researchers in computational linguistics.
- CLAIRLIB: A set of modules for NLP, including a set of tools for bibliometric analysis.
- ACL Anthology: The current version of the ACL Anthology, from which the ACL ARC is derived from.
- ParsCit: A tool to automatically perform reference string parsing as well as logical document structure recovery. Created by the folks at WING of National University of Singapore (NUS).
Our efforts have been supported by the grassroots initiative call
made by the ACL Exec at the ACL annual 2007 meeting in Prague. We
would like to acknowledge the support of the ACL Exec in encouraging
this form of collaboration.
Thanks also go to Behrang Qasemizadeh, PhD student in the Unit for
Natural Language Processing, Digital Enterprise Research Institute of
the National University of Ireland, Galway (funded by Science
Foundation Ireland) for his work on the SEPID ARC format and to Martin
Helmout of Southampton for his work on proofchecking the files and
schema of the XML files.
Min-Yen Kan <firstname.lastname@example.org>
Created on: Wed May 5 16:07:15 2004
| Version: 1.0
| Last modified:
Sat Mar 29 00:26:41 2008