There is no need for stage setting in this post. The problem presented is experienced by everyone to some extent, especially those in information management for Fortune 500 sized organizations. There is no need to intimidate with consultant statistics, estimations of fees associated, or theorize amount of (wo)man hours sunk into it. The Three V’s of big data (Volume, Variety, and Velocity) have surpassed human/manual abilities to maintain it (No, I will not include Voracity, stop trying to make V words fit). Organizations are being forced to automate the processes around retaining, categorizing, and searching for data to keep up.
Two primary methods are emerging through the enterprise scramble for automation: Machine learning vs. policy engines. Although these methods surfaced several years ago, they have made great strides to reduce the human workload. But let me be very clear. The day where these strategies don’t need human oversight or interaction may be years away. But since we don’t have time for that, let’s dive into what we have now.
Gaining popularity in the review stages of litigation, machine learning or Technology assisted review (TAR) is designed to take a sample set of manually review documents and discover relevant information from the total set. This is greatly beneficial for sifting through the “garbage” documents to meet tight deadlines or even clean out file shares to some extent. Of course, being the early stages of its life machine learning do have some drawbacks.
- Sample set: The sample review set becomes critical to success. Not only in size of the sample set to guarantee a certain percent accuracy but also the caliber of work performed on the sample set. Miscalculation in either could result in missing key information.
- Learning duration: The “lesson” learned by the machine are localized to a single use case. For each new use case, the machine must be retaught.
Conversely to machine learning, policy engines are gaining popularity around categorization of information upon creation. A policy engine essentially creates a rule set to automatically organize information. This streamlines search and ensures proper retention, instead of relying on end users, who tend to be the bane of information manager’s work life. Policy engine based automation also comes with a few caveats.
- Up front work: For a policy engine to work optimally, creation of a policy taxonomy must be worked out beforehand. This will also be an iterative process. Regulations, corporate structure, and data sources do change. This may cause the organization to add or remove rules from the engine.
- Metadata rule criteria: Some Policy engines lack the processing power or scaling ability to create content based categorization. Granularity of content is key when dealing with PII and other sensitive information.
As both technologies grow, they will begin to encroach on each other’s territory. Many systems will also likely adopt hybrid strategies in the future. But for those who can’t wait, determining which is better for your enterprise relies on company culture and your industry. The faster changing environments with more regulation might need to change things on the fly without throwing out their entire playbook while more static environments might have the time to retrain their systems. Either way, be prepared for a good amount of human intervention for the foreseeable future.