The data lake; we’ve all heard of it, but do we know exactly what it is? In our experience people recognize that it’s an exciting concept, and they can convey that it’s supposed to be a central repository for all enterprise data that companies can then leverage using analytics in order to gain valuable business insights. But when it comes to the details of the lake, what people know for sure gets increasingly shakier. And the line between truths, half-truths, and myths gets blurry quickly.
Hence, the need for a practitioner’s guide to the data lake.
The data lake as a concept is easy enough to generally understand, but it gets a little more complicated when talking about what makes a good one – or even how to build one. So most importantly, let’s iron out what exactly a data lake should be.
- A single repository that can handle all enterprise data, both structured and unstructured
In order for a data lake to be effective, it needs to hold all organizational data. From word documents, to email metadata, to transaction data… and everything else in between.
- A single governance platform where data is managed to address any risk concerns
Having a lake is the easy part. The hard part is managing it, and of course mitigating risk and privacy issues. Therefore, making sure your governance provider is capable of handling and protecting the data lake (even at the individual document level) is absolutely essential. Granular access control settings are needed to stop sensitive data from falling into the wrong hands.
- A duplicate-free and bias-free foundation for analytics
Before being able to glean actionable business insights through analytics, you need a clean data set: a data set that’s free of duplicate content and sampling biases. Without this, all the work that went into creating, maintaining, and managing the data lake is all for naught.
So in the end, the data lake should be an extremely useful data handling approach for any company: an advanced method for storing data and turning it into a valuable resource. However, there’s a catch. A poorly constructed and managed data lake may produce “insights” that are incorrect and misleading. A lake is purposely constructed, controlled, managed, and cleaned.
In reality, the ideal lake is more like a manmade data reservoir, with a monitored flow inwards and a dam for regulating outflow. But that doesn’t have the same fun ring to it, now does it?