Diving into the Benefits of Data Lakes

A well-designed data architecture can help turn data into actionable insights.

May 19, 2023

6 min read

Dreamstime Xxl 238614734 1 6466d4f340237

In the early 2000s, manufacturers asked, “How can we collect more data?” The industry resoundingly responded. Today, manufacturers collect unprecedented amounts of data, causing many to remember the old saying, “be careful what you wish for,” because they now ask a new question: “How can I make sense of all this data?” After all, data is valuable only as it produces insights that drive productivity and profit.

IoT devices are driving this flood of data. According to a recent study by Juniper Research, there will be 83 billion connected IoT devices by 2024, and 70% of these will be in the industrial sector. Manufacturers have quickly realized that their legacy data solutions aren’t ready to handle the exponentially greater amounts of data coming off the factory floor and through various enterprise software solutions.

Manufacturers are challenged to efficiently integrate data from disparate sources, analyze that data for production insights and then store it for future use and compliance.

Examples of these challenges include:

1. Creating traceability reports given a work order or batch number, including data such as cleaning logs, setup logs, raw materials used, all captured parameters during the job run, operators, lab data, etc.

2. Reconciling raw material inventory with estimated consumption in recipes. Even manufacturers are capturing all raw-material-dispensing data, many continue to do backflush for raw material inventory, resulting in a process that looks more like guesswork than science and leads to overspending on the material.

3. Meaningfully analyzing sensor data for predictive maintenance insights and running the machine learning algorithms necessary to notice data trends that lead to malfunction before it happens.

The stakes are high, as many manufacturers have made significant investments in smart manufacturing and need to see a return on their investment. If they don’t, they risk being worse off than before they started their digital initiatives, as they have the same level of insight yet have spent more money and created more work.

Many manufacturers have brilliant data sources, but they are isolated from one another. A well-designed data architecture can help solve this by connecting disparate data sources, contextualizing them and using AI/ML-based data analysis techniques to find meaningful insights that were too difficult to see.

How Data Lakes Work

Data lakes—centralized systems to store, process and secure large amounts of structured, semi-structured, and unstructured data—solve a core manufacturing challenge: how to make sense of data in real-time and far into the future. They have two sides: the data-in side, where raw data is ingested, and the data-out side, where the processed data lives for analysis.

On the data-in side, data lakes gather unstructured data from across relevant IT and OT systems. Everything from factory floor sensor data to CRM data can come into a data lake in its original format.

The data-out side is the actionable side. Here, the data is stored after processing for analytics. Under the hood of the actionable side are multiple types of databases. For example, a SQL database is used for lower volumes of generally organized data (in the thousands, not millions or billions), such as work orders from ERP systems. Then a timeseries database is used to store billions and trillions of contextualized sensor data that benefits traceability and compliance. However, timeseries databases are not fast enough for real-time visibility and alerts across the plant floor. For that, you use in-memory databases to store near-present data for super-fast access. Finally, blob storage is used for unstructured/binary data, such as videos, images, and PDFs.

All of this is hosted on the cloud, with the scalability to process real-time data and store data for compliance and traceability down the line.

Flexibility is Key

One of the unique features of a data lake–and this distinguishes them from data warehouses–is that data lakes are not schema-in, meaning that they can take in raw data without changing any of the data attributes from the source. They only apply schema to the data on the actionable side of the data lake once it has been processed for analytics, making it schema-out. Data warehouses, on the other hand, are schema-in, meaning the data must be structured before coming in.

With the schema-out structure of data lakes, manufacturers organize what information they need, when they need it. On the other hand, with a schema-in structure, manufacturers must define their data needs before using it. And in most cases, they only end up using a fraction of the data.

Data lakes are also notable for their flexibility. For instance, if you need additional information in the future to run new analysis, you can reprocess raw data in a data lake. With a data warehouse, the raw data is no longer available because data warehouses apply schema on the data-in side. Allowing new ways of processing data in the future would require revisiting the schema structure, which can take significant time.

Solving Data Challenges

Data lakes are helping manufacturers solve concrete problems.

For example, a quality problem—whether it is food contamination or a delivery that doesn’t meet specifications—can mean a dreaded slog through uncontextualized (or even paper) data to find out what happened. Moving to a data lake turns this process into a simple query that can be done in minutes. Even better, pro-actively creating traceability reports means that manufacturers can quickly identify the underlying issue and pinpoint just the affected units, keeping product on the shelves and ultimately protecting consumers and the bottom line.

Similarly, with a few IIoT sensors, manufacturers can reconcile raw material with consumption, tracking how the material is transformed during the production process. Then, this data can be correlated in the data lake and fed back into the ERP, minimizing guesswork and approximation.

Finally, the pinnacle of data-driven manufacturing for many is predictive maintenance. While predictive maintenance requires a combination of people, process, and technology expertise, data is the foundation. Without enough data and the right data architecture, it will remain elusive.

Data lakes have the processing power to analyze immense amounts of sensor data in real-time and then visualize it to enable pattern recognition. This opens the door to testing and training machine learning models on the historical sensor data to identify the precursors to machine failure. As manufacturers and their technology partners fine-tune these models, they become increasingly adept at predicting machine failure far enough in advance to perform necessary maintenance to prevent failure.

Kausik “KD” Dasgupta is chief technology officer, FactoryEye.

Contributors:

Kausik Dasgupta