Published by: ZL Tech
Leading eDiscovery Standards Organization and Enterprise Vendor Unite to Launch Standard Data Set for eDiscovery and Email Research
SAN JOSE, CA – Nov. 15, 2010 The Electronic Discovery Reference Model (EDRM) and ZL Technologies, Inc. today announced the launch of Version 2 of the EDRM Enron Email Data Set. This version offers the largest and richest set of publically available, general-purpose corporate email to date. Research into e-discovery and information retrieval require such data sets to test, develop and refine new capabilities and approaches; however, their availability is limited due to the competitive and private nature of enterprise communications. To facilitate advancements in commercial e-discovery and academic research, the EDRM data sets are provided free of charge and build on industry best practices. For Version 2 of the data set, EDRM and ZL Technologies are pleased to announce collaboration with the Text REtrieval Conference (TREC) Legal Track project that is using the data set in its research.
The new data set includes several improvements over the previous version, including direct input from various research communities. Some highlights include:
- Larger Data Set: Inclusion of 1,227,255 emails with 493,384 attachments covering 151 custodians.
- Rich Metadata: Threading information, tracking IDs, and general Internet headers are included.
- Multiple Email Formats: The new data set provides both full and de-duplicated email in PST, MIME and EDRM XML, which allows organizations to test and compare results across formats.
- Attached Files: Attachments to email messages are included in Version 2, as they were in Version 1.
Public data sets allow organizations to perform better research by comparing test results and reproducing third-party tests. This enables organizations to leverage past work by themselves and others. Additionally, the use of a common data set facilitates the creation of benchmarks upon which to judge competing applications and workflows. It is this manner that Version 2 of the EDRM Enron Email Data Set will contribute to the information management and e-discovery community.
This data set was produced with input from the Text REtrieval Conference (TREC) Legal Track project, which uses the EDRM data set for its research. Doug Oard, Professor at the University of Maryland and TREC Legal Track Coordinator remarked, “We are delighted to have had this opportunity to collaborate with EDRM. One of our goals in the TREC Legal Track is to produce test collections that are of lasting value to the e-discovery community, which is well in line with those of the EDRM Data Set Project. EDRM provides us with the opportunity to leverage current commercial best practice for collection processing, thus significantly facilitating the research of both commercial and academic research teams.”
The EDRM Data Set Project has taken on a growing role in the practice of e-discovery. Littler Mendelson, P.C., an AMLAW 100 law firm, uses the EDRM Data Sets to meet several needs. “We regularly evaluate new e-discovery technologies, and thus require a large, rich data set for our tests that does not include client information,” said Michael McGuire, Shareholder and eDiscovery counsel for Littler. “The EDRM Data Sets allow us to direct vendors to download the data directly from EDRM.net and set up the test systems with data that we are familiar with, understand, and do not have a responsibility to hold in confidence. We also use the EDRM Data Sets for training and demonstrations of our tools and processes. We look forward to using this new data set.”
“With delivery of this data set, the EDRM Data Set Project has achieved and surpassed its initial goals,” said John Wang, Project Lead for the EDRM Data Set Project and Product Manager at ZL Technologies, Inc. “We have been delighted with the adoption of the data sets already released as well as the interest in this data set. For this release, ZL Technologies worked closely with EDRM and TREC Legal Track to process the data using ZL Unified Archive®, which ensured the data was accurate, rich, and available in industry standard EDRM XML, PST, and MIME formats. We look forward to continuing to facilitate and refine the practice of e-discovery.”