ACL Anthology Reference Corpus

This is the home page of the ACL Anthology Reference Corpus, a corpus of scholarly publications about Computational Linguistics. This corpus has two versions; both are canonicalized subsets of the ACL Anthology. The newer version includes all ACL Anthology files whose copyright belongs to the ACL (excluding COLING, LREC, etc.), up to December 2015, consisting of 22,878 articles. We hope this frozen corpus will be used for benchmarking applications for scholarly and bibliometric data processing.


Our efforts have been supported by the grassroots initiative call made by the ACL Exec at the ACL annual 2007 meeting in Prague. We would like to acknowledge the support of the ACL Exec in encouraging this form of collaboration.

Thanks also go to Behrang Qasemizadeh, PhD student in the Unit for Natural Language Processing, Digital Enterprise Research Institute of the National University of Ireland, Galway (funded by Science Foundation Ireland) for his work on the SEPID ARC format and to Martin Helmout of Southampton for his work on proofchecking the files and schema of the XML files.