Confused about how to solve your data silo problems? Most organizations are.
The term “data silos” simply refers to data isolated in different systems. On its own, having data in different systems isn’t necessarily an issue. But it’s an immense problem when modern organizations want to combine data stored across multiple systems to answer business questions or gain insights from that data.
The struggle to figure out how to break down siloed systems isn’t new. It’s been a pervasive issue that businesses have had to grapple with for decades, but now it’s imperative to break down these walls.
Data silos directly impact an organization’s ability to grow revenue and decrease costs. Not having access to all your data impedes productivity, collaboration, and the ability to have a 360-degree view of your customer. A typical symptom of the problem is poor customer experience.
Look inside any company with siloed systems, and you’ll find siloed knowledge and processes. That results in duplication of effort, inconsistent data, and data integrity issues, including confusion about which data is actually accurate.
In this article, we’ll explain the most common solutions invented to solve the problem of data silos, and we’ll give you some context and pros and cons of each.
Data Warehouse
A data warehouse is the original solution to the problem of data silos, and it’s still the most popular today. You must copy data from each of the data silos and move it into a central database. From there, the data can be cleaned and modeled (using a star schema, usually).
The key advantage to a data warehouse is centralization. Data is stored in the same location so it can be easily joined. Business logic is also stored in one place to keep reporting consistent.
The problem with this approach is that data loading and data modeling are very expensive operations, both in terms of infrastructure and the need for highly skilled labor.
In the past, most data warehouses have been on-premises. Now it’s become increasingly common for organizations to turn to cloud data warehouses (such as Snowflake) due to their flexibility and scalability.
Data Virtualization
Data virtualization (or a virtual data warehouse) tries to avoid problems with traditional ETL (extract, transform, and load). Instead of bringing data to a common database, data virtualization leaves the data in the source systems. The data virtualization software parses the query and brings the relevant data to a common server, where it is combined and then served to the user.
You are still technically doing ETL, but it is more dynamic and on a per query basis. Data virtualization is much faster to get up and running, as opposed to traditional data warehouses. You are also always getting fresh data.
The main advantages of a data virtualization platform, like Denodo for example, is that the platform is technology-agnostic. Denodo has many integration connectors, such as for Oracle, SQL Server, Salesforce, or SharePoint.
In addition, you don’t need to invest as much money or time up front, as you would with a traditional data warehouse. Denodo is often used for proofs of concept where you let users see the data and the benefits before they commit.
The main drawback is source system performance. If your source systems are slow, then your data virtualization system will be slow. You can schedule and cache queries, but then you are essentially recreating a traditional data warehouse.
Event Streaming
The newest option on the block, popularized by Apache Kafka, is called event streaming. Streaming platforms are immensely complicated, but the basic operation is simple.
When an event is created (roughly analogous to a database row), it is sent to a topic (roughly analogous to a database table) in the streaming cluster. Any number of consumers can watch for new events.
Once the event lands in a topic, the consumer decides how to handle it. The event consumer has no knowledge of the event producer. This makes it easy to stream data between systems.
Streaming systems are very good for tackling data silos in an operational way, but not so much in an analytic sense. Thus, one of the consumers of the event stream is usually a data warehouse. If the only consumer of the streaming platform is a data warehouse, then it ends up being a more complicated and expensive ETL option (unless you really need low latency).
Which One Do You Choose?
There is no “right answer” about which option is the best. The explanations above hinted at guidelines that might dismiss one option or another, but ultimately, breaking down silos boils down to looking at what data, systems, and processes you have, what you want to achieve, and the specific outcomes you’re looking for. Then you can work backwards to choose the right solution for your organization.
Note that these options are not exclusive. Companies can and often do combine them. Doing so can help enhance the strengths and mitigate the weakness of the individual solutions. However, the combined architectures are usually more expensive and much more complicated.
If you have any questions, reach out to us. Onebridge has been providing data analytics solutions for 17 years, and we’ve helped a wide range of enterprise organizations in different industries solve their problems with siloed data.
Other articles that might help:
Data Warehousing and Engineering
Data Modernization with Snowflake
Data Virtualization Usage Patterns for BI and Data Warehouse Architectures