ACL Anthology Reference Corpus


Introduction

ACL Anthology Reference Corpus, Linguistic Data Consortium (LDC) catalog number LDC2009T29 and isbn 1-58563-531-6, is a digital archive of 10,291 research papers in computational linguistics sponsored by the Association for Computational Linguistics (ACL). Also available from the ACL, this release contains most of the papers that appear up to February 2007 in the web-based ACL Anthology, a dynamic repository that currently hosts over 16,500 articles drawn from a range of conferences and workshops as well as past issues of the Computational Linguistics journal. The ACL Reference Anthology is designed to be a standard, real-world digitial collection testbed for experiments in bibliographic and bibliometric research.

The ACL is the international scientific and professional society for scholars working on problems involving natural language and computation. Membership includes the ACL quarterly journal, Computational Linguistics, reduced registration at most ACL-sponsored conferences, discounts on ACL-sponsored publications and participation in ACL Special Interest Groups. Since 1988, Computational Linguistics has been the primary forum for research on computational linguistics and natural language processing.

Data

The material in the ACL Anthology Reference Corpus was scanned at 600dpi grayscale for archival storage, down-sampled to 300dpi black-and-white, assembled into articles and stored in the ‘PDF Image with Hidden Text’ format. Author and title metadata was extracted from the OCRed text and used to build HTML index pages. Older materials, such as conference proceedings from the 1960s and early volumes of Computational Lingistics, were manually digitized from microfiche slides.

ACL Reference Anthology includes:

Please see file.tbl for the directory structure of this publication, as well as a complete list of files.

Please go to data for a listing of data files.

Other documentation files are:

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2009T29.

Licensing

This corpus is made available to all users (LDC members and nonmembers) under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license.

Content Copyright

Portions © 1963-2006 Association for Computational Linguistics, © 2009 Trustees of the University of Pennsylvania


Contact: ldc@ldc.upenn.edu
© 2009 Linguistic Data Consortium, Trustees of the University of Pennsylvania. All Rights Reserved.