Arctic Wolf’s liquid clustering architecture tuned for petabyte scale

by
0 comments
Arctic Wolf's liquid clustering architecture tuned for petabyte scale

Every day, Arctic Wolf Processes over a trillion events, converting billions of rich records into security-relevant insights. This translates to over 60TB of compressed telemetry, providing AI-powered threat detection and response – 24×7, with no lag. To find threats in real time, we had to make this data available to customers and the security operations center as quickly as possible, with the goal of having most queries returned within 15 seconds.

Historically, we have had to leverage other faster datastores to provide access to recent data because partitioning + z-ordering could not keep up. when we find out suspicious activityOur team can quickly look back at three months of historical context to understand attack patterns, lateral movement, and the full scope of the compromise. This real-time historical analysis against 3.8+ PB compressed data is critical victim of modern threat: The difference between controlling a breach in hours versus days could mean millions in damage prevented.

When every second counts, speed and freshness matter. Arctic Wolf needs to accelerate access to large-scale datasets without increasing ingestion costs or adding complexity. challenge? The investigation was slowed down by heavy file I/O and stale data. By rethinking how data is organized, our architecture efficiently manages multi-tenant data skew, where a small portion of customers generate the majority of events, while also accommodating late-arriving data that may appear weeks after initial ingestion. Measurable benefits include reducing file count from 4M+ to 2M, reducing query time percent by ~50%, and reducing 90-day query time from 51 seconds to just 6.6 seconds. Data freshness improved from hours to minutes, making access to security telemetry almost instantaneous.

Read on to learn how Liquid Clustering and Unity Catalog managed tables made this possible—delivering consistent performance and real-time insights at scale.

Legacy constraints: why the Arctic wolf reintroduced

Our legacy table, partitioned by Z-order by event date-hour and tenant identifier, could not be queried in real time due to the partitioning into a large number of small files. Additionally, the data is only available outside the last 24 hours, because we had to run OPTIMIZE with Z-ordering before querying the data.

Nevertheless, performance issues persisted due to late data arrivals. This occurs when a system goes offline before data can be transmitted, resulting in new data being accessed in older partitions and performance will be affected.

Old data blinds us. This delay is the difference between overpowering an opponent and allowing them to advance.

To mitigate these performance challenges and provide data freshness we need to duplicate our hot data in the data accelerator and blend the queries with data from our data lake to meet our business requirement. The system was expensive to run and required significant engineering effort to maintain.

To address these challenges of using data accelerators, we redesigned our data layout to distribute data more evenly and support late-arriving data. It optimizes query performance and enables real-time access to current and emerging agentic AI use cases.

Building a Streaming Data Foundation with Liquid Clustering

With our new architecture, our main objective is to be able to query the most recent data, providing consistent query performance across different client sizes, while having queries returned in seconds.

The reengineered pipeline follows a Medallion architecture, starting with continuous Kafka ingestion into the Bronze layer for raw event data. Hourly structured streaming jobs then flatten the nested JSON payload and write to the Silver table with Liquid Clustering, forming the primary analytical base. Here, bronze-to-silver transformations handle schema development, generate derived temporal columns, and prepare data for downstream analytical workloads with strict latency SLAs.

Liquid clustering replaced rigid partition schemes with workload-aware, multi-dimensional clustering keys that are tied to query patterns, specifically by tenant identifier and date granularity, table size, and data arrival attributes. Distributing the data more evenly and increasing the average file size to more than 1GB in our example dramatically reduces the number of files scanned during a typical time-window query for our table.

Deep Dive: Clustering on Write

Additionally, our structured streaming jobs leverage clustering on writes to maintain the file layout as new data arrives. It acts like a localized optimize operation, applying clustering only to newly ingested data. Therefore, the data received is already optimized. However, if the ingestion batches are too small, they generate many small but well-clustered files that still need to be clustered during global optimize to achieve an ideal data layout. Conversely, if the batch size at ingestion approaches the batch size required for global optimization, additional optimization is often unnecessary.

For workloads that ingest very large amounts of data (for example, terabytes), we recommend batching at the source, such as using foreachBatch maxBytesPerTriggerTo ensure efficient clustering and file layout. with maxBytesPerTriggerWe can control the batch size by removing many small cluster islands that would need reconciliation through the OPTIMIZE operation. With a size close to what the optimize operation works on, we were able to create optimal batches to minimize the work required by optimize.

Impact on security analysis of the Arctic Wolf

Arctic Wolf’s migration to Liquid Clustering resulted in substantial quantifiable improvements in performance, data freshness, and operational efficiency. UC Managed Tables with predictive optimization also reduces the need to schedule maintenance.

File count reduced from 4M+ to 2M, reducing file I/O during queries while maintaining good cluster quality. As a result, query performance was drastically improved, allowing security analysts to investigate incidents faster: ~50% faster across percentiles and A large number of our customers are ~90% fasterWith 90-day queries expiring 51 seconds to 6.6 seconds.

By implementing clustering-on-write, we reduced data freshness from hours to minutes, increasing time-to-insights by approximately ~90%. This enables improvement Threat detection in near real time In Arctic Wolf’s data lake.

The transition to Liquid Clustering and Unity Catalog managed tables eliminated legacy partitioning, reduced technical debt, and unlocked advanced administration and performance features. With an architecture capable of processing and querying 260+ billion rows per day, we provide faster, more efficient access to critical security data from all these sources. Combined with our 24/7 Concierge Security® team and real-time threat detection, this enables quicker, more accurate threat response and mitigation. These differentiators help us customers Gain a stronger, more agile security posture and greater confidence in Arctic Wolf’s ability to protect its environment and support ongoing business success.

Related Articles

Leave a Comment