ACL Anthology Reference Corpus (ACL ARC)
[ Back
to the ACL home page ]
[ Back
to WING ]
This is the home page of the ACL Anthology Reference Corpus, a
corpus of scholarly publications about Computational Linguistics.
This corpus is a canonicalized subset of the ACL Anthology, up to
February 2007, consisting of 10,921 articles. We hope this frozen
corpus will be used for benchmarking applications for scholarly and
bibliometric data processing.
Download the corpus
- Version 20090501: This is the version distributed by the Linguistic Data Consortium (LDC). This version adds page images in both text and image forms, as created by running OCR over the PDF files (Nuance Omnipage 15 or 16). For detailed information about the citation structure, do see the related project, Anthology Author Network (see below).
[ DVD Disc 1 ] - Interlink data (clean), XML metadata (not clean, from Anthology), Image files in PNG format, text files from Omnipage in formatted and normal styles.
[ DVD Disc 2 ] - continuing PNG files.
[ DVD Disc 3 ] - remaining PNG files, PDFs from Anthology
[ DVD Disc 4 ] - remaining PDF files, text files from Omnipage in XML style.
If you have a copy of the ACL ARC from LDC, you may be missing some of the key files, files that give the textual dump of each page in three different formats. Here are a few quick links to the files:
- Version 20080325: This is the version described in the LREC paper that contains the canonical 10,921 computational linguistics papers as PDF and plain text files, with the associated metadata. (You can also email me to request a DVD copy of the corpus)
[ Complete tgz file from NUS ] [ Complete tgz file from Macquarie Univ. (courtesy Robert Dale) ] Warning, Huge! (4621149669 bytes, ~4.4 GB) Expect re-tries, use a client with resume capability
[ tgz file (without PDFs) ] (111001977 bytes, ~100MB)
Publications
Refereed:
- Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev and Yee Fan Tan (2008) The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proc. of Language Resources and Evaluation Conference (LREC 08). Marrakesh, Morocco, May.
[ .pdf pre-print ]
[ Slides (.htm) ]
Group Members
- Min-Yen Kan - Project leader, National University of Singapore
- Steven Bird, University of Melbourne
- Robert Dale, Macquarie University
- Bonnie Dorr, University of Maryland
- Bryan Gibson, University of Michigan
- Mark Joseph, University of Michigan
- Dongwon Lee, Pennsylvania State University
- Brett Powley, Macquarie University
- Dragomir Radev, University of Michigan
- Yee Fan Tan, National University of Singapore
Tools and Related Links
Links to information about the corpus itself and specific tools to
process it.
Here we list some related tools for bibliographic processing, and related sites for bibliographic research.
- The corpus description and ordering information at LDC. Note that the corpus is also free of charge from this website.
- ACL Anthology Network: A parallel initiative at the University of Michigan to construct a social network graph of researchers in computational linguistics.
- CLAIRLIB: A set of modules for NLP, including a set of tools for bibliometric analysis.
- ACL Anthology: The current version of the ACL Anthology, from which the ACL ARC is derived from.
- ParsCit: A tool to automatically perform reference string parsing as well as logical document structure recovery. Created by the folks at WING of National University of Singapore (NUS).
Acknowledgments
Our efforts have been supported by the grassroots initiative call
made by the ACL Exec at the ACL annual 2007 meeting in Prague. We
would like to acknowledge the support of the ACL Exec in encouraging
this form of collaboration.
Thanks also go to Behrang Qasemizadeh, PhD student in the Unit for
Natural Language Processing, Digital Enterprise Research Institute of
the National University of Ireland, Galway (funded by Science
Foundation Ireland) for his work on the SEPID ARC format and to Martin
Helmout of Southampton for his work on proofchecking the files and
schema of the XML files.
Min-Yen Kan <kanmy@comp.nus.edu.sg>
Created on: Wed May 5 16:07:15 2004
| Version: 1.0
| Last modified:
Sat Mar 29 00:26:41 2008