LogSentinel: How Databricks Uses Databricks for LLM-Powered PII Detection and Governance

Databricks operates at a scale where our internal logs and datasets are constantly changing – schemas evolve, new columns appear, and data semantics drift. This blog discusses How we use Databricks internally at Databricks So that our platform continues to change, PII and other sensitive data can be labeled correctly.

To do this, we built LogSentinel, an LLM-powered data classification system on Databricks, which tracks schema evolution, detects labeling drift, and feeds high-quality labels into our governance and security controls. We use MLflow to track experiments and monitor performance over time, and we’re integrating back the best ideas from LogSentinel databricks data classification products so that customers can benefit from the same approach.

Why does this system matter

This system is designed to move three concrete business levers to the platform, data and security teams:

short compliance cycle: Recurring review tasks that previously took weeks of analysts’ time are now completed in hours, because columns are pre-labeled and pre-tried before a human can even see them.
low operational risk: The system continuously detects labeling drift and schema changes, so sensitive areas are less likely to slip silently with incorrect or missing tags.
strong policy enforcement: Trusted labels now directly drive masking, access control, retention, and residency rules, turning “best-effort governance” into an executable policy.

In practice, teams can plug new tables into a standard pipeline, monitor drift metrics and exceptions, and trust the system to enforce PII and residency constraints without having to build a custom classifier for each domain.

System Architecture at a Glance

We built an LLM-powered column classification system on Databricks that continuously annotates tables using our internal data classification, detects labeling drift, and opens remediation tickets when something looks wrong. The various components involved in the system are outlined below (tracked and evaluated using MLFlow):

Data Ingestion: Ingesting various data sources (including Unity Catalog column data, label classification data, and ground truth data)
Data Augmentation: Augmenting Data Using Databricks Vector Search AI comment generation
LLM Orchestra
level labeling system
Model versioning: running multiple models in parallel
Label prediction: predicting final labels using mixture of experts (MOE) approach
Ticket creation: Finding violations and creating JIRA tickets

The end-to-end workflow is shown in the figure below

End to End Workflow

data ingestion

For each log type or dataset to be annotated, we randomly sample values from each column and send the following metadata to the system: table name, column name, type, existing comment, and a small sample of values. To reduce LLM costs and improve throughput, multiple columns from the same table are batched together in a single request.

Our taxonomy is defined using protocol buffers and currently includes over 100 hierarchical data labels, with room for custom extensions when teams need additional categories. This gives governance and platform stakeholders a shared agreement on the meaning of “PII” and “sensitive” beyond a handful of regexes.

data enrichment

Two enhancement strategies significantly improve classification quality:

AI column comment generation: what we use when comments are missing Databricks AI-Generated Comments To synthesize concise, human-readable descriptions that help both LLMs and future table consumers.
Few-shot example generation: We maintain a ground truth dataset and use both static examples and dynamic examples obtained through vector search; For each column, we create an embedding from name, type, comment, and context, then retrieve columns with top-like labels to include in the prompt.

During early stages or when labeled data is limited, static notation is best, providing consistency and reproducibility. Dynamic prompting is more effective in mature systems, using vector search to pull similar instances and adapt to new schemas and data domains in large, diverse datasets.

LLM Orchestra

At the core of the system is a lightweight orchestration layer that manages LLM calls at production scale.

Key capabilities include:

Multi-model routing in internally hosted LLMs (for example, Llama, cloud, and GPT-based models) with automatic fallback if a model is unavailable.
Re-examine the logic for rate limits with transient failures and exponential backoff.
Validation hooks that detect empty, invalid, or confused labels and re-run those cases with the backup model.
Batch processing that annotates multiple columns at once to optimize token usage without losing context.

level labeling system

We predict three types of labels per column:

Granular labels are built from a set of 100+ fine granularity options that power masking, reduction, and strict access control.
Hierarchical labels, which aggregate related granular labels into broad categories suitable for monitoring and reporting.
Residency labels, which indicate whether data must remain in the region or can be transferred cross-region, directly feed data movement policies.

To keep predictions consistent and minimize hallucinations, we use a two-step flow: a broad classification step specifies a high-level category, then a refinement step chooses accurate labels within that category. This demonstrates how a human reviewer would first decide “this is workspace data” and then choose the specific workspace identifier label.

Model version and label prediction

Instead of relying on a single “best” configuration, each model setup is treated as an expert competing to label a column.

Multiple model versions run in parallel with differences in the following:

Primary and fallback LLM options.
Use of generated observations versus raw metadata.
Signaling strategy (static vs. dynamic few-shots).
Label granularity and taxonomy subsets

Each expert produces a label and a confidence score between 0 and 100. The system then selects the label from the expert with the highest confidence, a mixture-expert style approach that improves accuracy and reduces the impact of sometimes poor predictions from any one configuration.

This design makes it safe to experiment: new models or accelerated strategies can be introduced, run alongside existing models, and evaluated on both metrics and downstream ticket volume before becoming the default.

ticket generation

The pipeline continuously compares the current schema annotations with the LLM predictions to uncover meaningful deviations.

Typical cases include:

New columns added without any annotation.
Existing annotations that no longer match the contents of the column.
Columns with sensitive values that are labeled as eligible for cross-region movement.

When the system detects a violation, it creates a policy entry and files a JIRA ticket to the owner team with the table, column, proposed label, and confidence reference. This turns data classification issues into an ongoing workflow that teams can track and resolve in the same way they track other production events.

Impact and evaluation

The system was evaluated on 2,258 labeled samples, of which 1,010 contained PII and 1,248 were non-PII. On this dataset, it reached 92% accuracy and 95% recall for PII detection.

More importantly for stakeholders, the deployment produced the operational results that were required:

Manual review effort for each large-scale audit cycle reduced from weeks to hours as reviewers start with high-quality suggested labels rather than raw schema.
Instead of being discovered during an annual review, labeling drift is now detected continuously as the schema is developed.
Alerts about sensitive data mislabeled as secure are more targeted, so security teams can take immediate action instead of testing noisy rules-based scanners.
Masking and residency policies are largely enforced using the same label classification that powers the analysis and reporting.

Precision and recall act as guardrails, but the system is tailored to outcomes such as review time, latency to drift detection, and volume of actionable tickets produced per week.

conclusion

By combining taxonomy-driven labeling and an MOE-style evaluation framework, we enabled existing engineering and governance workflows in Databricks to manage experiments and deployments using MLflow. This keeps labels fresh when the schema changes, makes compliance reviews faster and more focused, and provides the enforcement hooks needed to consistently enforce masking and residency rules across the platform.

The most exciting part of this work is integrating our internal learnings directly into the data classification product. As we pilot and validate these technologies inside LogSentinel, we incorporate our technologies directly into Databricks data classification.

The same patterns – ingesting metadata and samples, enhancing context, orchestrating multiple LLMs, and feeding predictions into policy and ticketing systems – can be reused wherever reliable, evolving understanding of the data is needed. By incorporating these insights into our core product offering, we are enabling every organization to leverage our data intelligence for compliance and governance with the same accuracy and scale as we do at Databricks.

approvals

This project was made possible through the collaboration of multiple engineering teams. thanks for doing Anirudh Kondaveeti, sitichai giampojmarn, zefan zoo, li yang, xiaohui sun, Divyendu Karmakar, chennen liang, Vishwesh Periyasamy, changzhou ou, avion kim, Matthew Hayes, benjamin ebanks, Sudeep Srivastava For their support and contribution.