Data Lakes in Cybersecurity - Balancing Promises with Reality

Read any latest research analyst or tech trends report and you are likely to see a discussion on how data lakes can bring a step change in extracting insights for various enterprise functions, including cybersecurity. An expanding enterprise footprint will continue to generate more and more data for security teams to store, monitor, and analyze, and technologies capable of moving the needle on these projects hold obvious appeal. There is growing hype surrounding data lake technologies solving a number of cybersecurity challenges, but expectations need to be managed on how much of an impact it will actually have on your security strategy.

The Power of Data Lakes

Enterprises wanting to stay ahead of the fast changing cybersecurity landscape have recognized they need to invest in data centric technologies. Continuously evolving attack vectors require increased data collection, longer data storage from a growing number of sources, and faster processing to get insights quicker. Then comes the hard part - being able to put those insights to work in your detection, prevention and investigative workflows. The traditional process of extracting value from data requires heavy upfront investment - identifying all sources of data, understanding their data structure, and storing it in a schema such that the eventual processing engine can do something with the data. Enterprises get pulled into long requirements design and technology architecture discussions and wait to move forward till all the unknowns are ironed out. The ability to reduce this upfront investment is the main appeal of data lakes - if (this is a big if and we’ll go into pitfalls to look out for later) done right they offer a way to take that first step faster while minimizing the risk of incurring a huge time and cost overhead. Let’s look at the top reasons data lakes offer cybersecurity leaders a way to get to business value faster.

  1. Centralized repository with reduced data structure overhead. From a purely capability perspective data lakes enable cybersecurity teams to store structured and unstructured data as-is at any scale, in any format. Now you can store relational and non-relational log data not only from your security tools, but also from infrastructure assets, productivity apps like video conferencing, and partner teams like fraud, physical security, and compliance - all in one place.
  2. Schema on Read offers flexibility to deploy new use cases. Minimal design data storage allows the application of data schemas after insights are generated and use cases become solidified. The ability to connect the dots between data pools from various teams to improve your cybersecurity posture doesn’t require outlining all the use cases ahead of time. When a new analytic model needs to be tested and validated, like detecting attacker tactics looking to exploit telework and remote collaboration tools, you are not required to redesign your data schema. Data lake technologies allow you to deploy and execute new analytic workloads without the need to significantly refactor how data is being ingested and stored.
  3. Cloud native deployments with reduced operational overhead. Data lakes have come of age with standardization of cloud native architectures and cloud computing capabilities. This is a big difference from the promise of the data revolutions of the past. Previously enterprises had to hire teams to build enormous infrastructure that could collect and process large volumes of data, and then tear it down and rebuild it when they encountered new data schemas. This is a huge distraction for enterprises that want to focus on extracting business value from their data. Cloud native infrastructure as a service models take away a lot of this complexity and give you flexibility to rethink value as new business needs arise with this evolving cyberattack landscape.

Managing Expectations

It’s important for security leaders to recognize that although the chasm between promise and practical impact of data lakes is fast reducing, there are implementation and integration challenges which can prevent it from becoming anything more than an innovation project.

First , building out and maintaining data collection infrastructure for a diverse set of data sources is complex. Forgoing the complexities of a schema during collection and storage of data doesn’t mean you don’t have to apply some level of organization and structure to the data. Without some upfront planning it can be difficult to figure out how to make use of the data and deliver insights to the right teams, and then your data lake effectively becomes an unused data store.

Second , it’s a non-trivial effort to integrate new data workloads that cut across a complex technology stack, numerous 3rd party applications and existing processes. The current era of ecosystems requires data lakes to be easy to integrate into architectures, support data collection from a number of sources and insights delivery to a variety of applications as well.

Third , specialized engineering expertise is still required to design, build and maintain the analytic workloads for extracting insights from data. Ensuring access and governance policies are built into these analytic workloads is essential but most cybersecurity teams don’t have access to this talent.

Taking the first step

Given the explosion of data that needs to be processed to keep up with the rapidly changing cyber landscape it’s imperative that cybersecurity leaders strongly consider how to use the power of data lakes. It’s power can be harnessed successfully, although expectations need to be managed around all the hype surrounding data lakes. Whether enterprise’s want to build their own capabilities or invest in a product that accelerate the journey they should consider the the following:

  1. Ease of collecting data from various internal and external sources, and updating the collection framework with new requirements.
  2. Ability to organize and structure the collected data so that your lake doesn’t turn into a data swamp.
  3. Availability of technical expertise to build and monitor analytic workflows, while maintaining data governance controls.
  4. Integrations with applications where insights will be delivered and eventually utilized by your analysts.
  5. Ease of updating existing use cases and implementing new ones in production as new attack vectors and threat information becomes available.

Cybersecurity leaders need to optimize for flexibility not only within their technology stack but also within their business processes. Thinking holistically about the business outcomes and charting out a roadmap that allows you to take small steps with product vendors will be critical in realizing the potential of data lakes.