Bayer is a life sciences company and global leader in health care and nutrition, active in more than 100 markets in 83 countries. Guided by its mission – health for all, hunger for none – Bayer plans to provide its 92,500 employees with secure, searchable access to massive data. Five years ago, fragmented systems made this nearly impossible, and teams working for the consumer health division suffered from not being able to properly use data to make decisions. By adopting Databricks and the Unity Catalog, Bayer Consumer Health created a single, governed data platform that enables self-service analytics without data silos.
With Databricks, we are building reusable core assets, enabling self-service analytics and fostering a data-driven organization that delivers insights for everyone and data silos for no one.– Andre Wutheno, Principal Cloud Platform Architect, Bayer
Global fragmentation and “data tourism”
As a globally distributed company, Bayer’s previous data analytics setup was fragmented across different markets, each of which used its own technology stack for different purposes. When data needed to be shared, it was often copied, sometimes multiple times, in what Bayer calls “data tourism”. Data tourism increased data management costs and slowed down the implementation of new solutions. This complexity, along with performance issues, led to low adoption of the solutions Bayer IT could provide and challenged the company’s ability to make data-driven decisions. Beyond cost and performance, data tourism makes it difficult to understand who is using what data, enforce consistent access controls, or confidently reuse trusted assets across the marketplace.
Additionally, Bayer faced significant challenges in leveraging the latest data analysis tools such as machine learning. “The systems needed to support machine learning added an additional cost and maintenance burden because we needed to move machine learning to a completely dedicated platform on a different technology stack, in a different data center, on a different type of scaler – so we couldn’t really use machine learning properly at that time,” said Andre Wutheno, principal cloud platform architect at Bayer.
When seeking solutions to these challenges, the Bayer Consumer Health Data & Analytics organization knew they needed to build a global, scalable data platform. With more than 2,000 business users and 25 zones across three global regions, supported by more than 250 machine learning and data engineers, Bayer needed a cloud-based system that could leverage serverless technology where possible. “It was important to ensure that our solutions would adapt to any data volume and number of simultaneous users to ensure everyone gets the best performance and immediate results,” said Wuthnow. The cloud-based solution will also be fiscally responsible, ensuring the buyer only pays for what it uses, and will allow the company to try out new services on a smaller scale before rolling them out as a global standard.
Template-based environment in Databricks
Bayer Consumer Health chose Databricks as the foundation of its data platform, enhanced with Azure services for data ingestion, storage, and more. All data transformation and data cleaning is done in Databricks, ensuring that raw data is transformed into reusable, quality-checked and trusted data assets. With this solution, Bayer can also expose Azure ML and other Azure AI services to its developers.
Databricks provides a unified, integrated platform to meet the needs of Bayer’s data engineers, whether they are creating BI reports, ML solutions, or analytical applications. With Databricks as its unified platform, Bayer can run multiple projects with multiple teams working in parallel without negatively impacting each other. Each team can independently manage the lifecycle of new data products. Knowing that its local markets would have unique data needs that would differ from global analytics, it needed a system that would centralize all of its data to avoid multiple copies and “data tourism,” while also providing each team the flexibility to leverage the data best suited to their markets. “We leveraged Databricks to create a template-based environment with dedicated service instances that ensures proper resource isolation and lifecycle management,” Wutheno said.
Unity Catalog provides centralized governance and metadata layer in these environments, allowing teams to control core data assets once deployed while enabling them to be securely consumed and reused across projects and regions.
Fast data product implementation and self-service reporting
With the introduction of Unity Catalog as a replacement for its Hive metastore, Bayer moved from a push-based to a pull-based data-sharing approach. Data consumers only need permission to access governed and trusted core data assets. This way, each data domain team can define itself what to share with whom, without having to copy data across environments. With the introduction of Serverless in combination with Unity Catalog, Bayer Consumer Health enabled secure connectivity from its development environment to production core data assets. This enabled data engineers to build new solutions in their development environments with production-grade data, leading to faster time to market for new analytics solutions, while still implementing data intrusion measures. “The Unity catalog was a game changer for us,” Wutheno said. “The new model makes it easier for us to ensure the latest data is available in data products at all stages, which speeds up the creation and testing of new solutions because engineers can use production-grade data to test their solutions.”
Bayer Consumer Health also introduced a central reporting endpoint that links to all of their catalogs. Because global core data assets are managed in a single region, employees can easily search and combine data across domains through a single, governed entry point, ensuring self-service analytics scale without reproducing silos or inconsistent definitions.
With Databricks and Unity Catalog, Bayer Consumer Health established shared standards for data access, naming, and security while maintaining flexibility. Governance is built into the platform rather than applied after the fact, allowing self-service analytics to scale with confidence. As Wuthnow says, “We are building reusable core assets, enabling self-service analytics and fostering a data-driven organization that delivers insights for everyone, not data silos for none.”
