Will Styler's Homepage
Will Styler

Associate Teaching Professor of Linguistics at UC San Diego

Director of UCSD's Computational Social Science Program

The EnronSent Corpus

The EnronSent corpus is a special preparation of a portion of the Enron Email Dataset designed specifically for use in Corpus Linguistics and language analysis. Divided across 45 plain text files, this corpus contains 2,205,910 lines and 13,810,266 words.

This preparation was created by cleaning up a portion of the original Enron Corpus. It contains 96,107 messages from the “Sent Mail” directories of all the users in the corpus. It has been cleaned specifically for use with conventional corpus linguistics tools (such as grep, python), and an attempt has been made to remove as much non-human generated text as possible from the raw messages in the original data. For more history on the original dataset, please see the homepage for the Enron Email Dataset.

Please see the included README file for more information about this data. For a more detailed explanation of the preparation of the corpus, please read University of Colorado Institute of Cognitive Science Technical Report 01-2011.

Citing the EnronSent Corpus

Styler, Will (2011). The EnronSent Corpus. Technical Report 01-2011, University of Colorado at Boulder Institute of Cognitive Science, Boulder, CO.

You can also use Google Scholar to see who has used and cited EnronSent previously.

Download the EnronSent Corpus

Privacy Concerns

This preparation and all corpus data is in the public domain. All messages in the Enron corpus were made public domain in 2003 by the United States Federal Energy Regulatory Commission during their investigation of Enron. The messages in the source data represent all of the email in the Enron Corporation’s database, and not just those of the investigated individuals. Although many of the concerned individuals have already had their messages removed from the source data, it is important to remember that the vast majority of the people whose messages are in this corpus were likely not directly involved in the investigation and wrongdoing at Enron. Please keep the privacy of these individuals in mind as you work with and publish data from this corpus.

Questions, comments or concerns?

The EnronSent Corpus does not include the identity of the sender or recipient of individual emails, nor strong delineation between messages. This is meant for linguists looking at patterns in language, so this is not necessary information, and the creator does not have a version which does. If you need individual emails or sender/recipient identity you’ll want to re-process the Enron Email Dataset yourself according to your needs.

Contact Will Styler, will (at) savethevowels (dot) org