Archiving

The Enron Corpus: Data Management’s Laboratory

The Enron Corpus: Data Management's Laboratory

You’re probably familiar with Enron, the energy services company that underwent a massive investigation in the early 2000s for wide-scale fraud. What you might not be familiar with is a curious byproduct of the case: A collection of thousands of Enron emails and documents that entered the public domain due to their role in the investigation. This collection of data, often called the “Enron Corpus,” is immensely valuable both for academia and for enterprise software developers who want to simulate an enterprise-scale communication environment. You might be surprised to know that not only does ZL use the Enron Corpus to test our software, we helped assemble it!

Origins of the Corpus

The Corpus was originally created in 2003, when FERC released a large data set of Enron documents into the public domain. MIT researcher Leslie Kaelbling purchased the raw files from a government contractor (the format in which FERC released the emails was unusable), deduplicated them, mapped their structure, and made them available to the public in 2004. After processing, about 200,000 emails remained.

In 2010, ZL collaborated with Duke Law’s EDRM project to release a new data set. The new version had more emails, preserved metadata, and included attachments. It was also made available in PST (Outlook/Exchange), MIME, and EDRM XML formats, which allowed organizations to use a testing environment that was similar to the one they might use internally.

An Enterprise-Sized Testing Ground

So what is the Enron Corpus good for? Broadly speaking, it has two uses. The first is that it provides insight into the email communication habits of an entire corporation, creating a huge sample size that’s valuable for researchers studying human interaction. The second is that companies developing information governance software, like ZL, can use it as a testing ground for new software. It’s surprisingly hard to find something that works for these purposes: Email databases of this size are impossible to acquire elsewhere, and creating a fictional one in-house would take too long and not represent an accurate email environment. Despite its benefits, the Enron corpus does pose some issues to information governance software testers. Doug Austin at JDSupra notes that, because the captured Enron database existed around 2002, its mailstore files are now significantly smaller than those the average modern organization’s eDiscovery process now looks through. As a fix, he suggests combining several of these files into one larger one to better simulate how the software might function in a real-life scenario.

Conclusion

Perhaps the oddest byproduct of the Enron debacle, the Enron Corpus now provides software developers with a testing ground for eDiscovery and other information governance solutions. While perhaps a new, more modern set of emails may be somehow available in the future, the Corpus is, foreseeably, the laboratory in which the next generation of enterprise-scale software solutions are put to the test.

I'm a Bay Area native who enjoys writing about the endlessly fascinating field of information governance. In my spare time, I enjoy making board games, baking, and attempting to convince everyone I know to watch The Genius.