Building a File Semantic Analyzer: Guarding Outbound Data at Scale with AI
Engineering Manager II
Staff Software Engineer
Sr. Security Technologist
Introduction
In today’s data-rich environments, organizations handle vast volumes of data across endpoints, cloud storage, and user devices. This data often exists or transits in the form of files, which can range from highly sensitive business documents and customer data to non-critical content. However, the absence of an efficient and accurate method to analyze file contents and determine their relevance—whether business-critical, personally identifiable, or non-essential—presents a significant challenge.
At Uber, data is the lifeblood of our operations. From ride details and payment information to internal project documents and strategic plans, the sheer volume and diversity of data is immense. Specifically, the challenge lies in accurately identifying and classifying this data—distinguishing between business-critical information and personal or non-essential files. The inability to scale traditional human-in-the-loop oversight against the massive flow of data egress creates a significant, unmanageable risk of sensitive organizational data, including intellectual property. Leaving the environment unchecked jeopardizes security, regulatory compliance, and business integrity.
This fundamental challenge drove my team to innovate, culminating in the development of an FSA (File Semantic Analyzer). This AI-powered system semantically classifies data and provides the nature and a summary of the information leaving our environment. The FSA dramatically reduces the need for manual oversight while significantly boosting the accuracy of our data protection efforts.
The Problem: A Needle in a Haystack of Bytes
Imagine a large enterprise with thousands of employees, each generating and sharing countless files daily. Think about the types of files: personal vacation photos, internal strategy documents, confidential customer lists, source code, design specifications, marketing collateral, and more. When these files egress—through email attachments, cloud uploads, or even USB drives—how do you, at scale, determine if a file contains business-critical information or is simply a personal document?
You might think DLP (Data Loss Prevention) systems can solve this problem easily. However, the fundamental shortcoming of traditional DLP systems is their inability to discern semantic meaning. They operate at the level of keywords, tokens, and pattern matching, not content. While this approach is useful in certain narrow scenarios, these methods are prone to high false positives and false negatives. For example, a document might contain the word “confidential” but can be a personal note. Conversely, a highly sensitive business plan could be written without any obvious keywords, slipping right through the cracks. This imprecision burdens security teams with endless, irrelevant alerts, leading to alert fatigue and potentially overlooking genuine, high-stakes threats.
So, the need was clear: we had to develop an innovative solution that could truly read and understand a file, not just scan it.
The Solution: Embracing Semantic Understanding with GenAI
Our answer to this problem was to move beyond mere pattern matching and embrace semantic understanding. We wanted to understand the meaning and context of the file content, not just its superficial characteristics. We asked ourselves: Can we not just classify, but also interpret and summarize the files, providing immediate, human-readable insights for our security analysts? This is where AI, particularly Generative AI, became our indispensable ally.
The core of our File Semantic Analyzer is an intelligent classification engine that learns to differentiate between various categories of files based on their intrinsic content. Here’s a high-level overview of our engineering journey.
1. Data Samples and Labeling: The Foundation of Intelligence
No AI model can thrive without high-quality, labeled data. This was the most labor-intensive but critical phase. We continue to collect and anonymize a diverse dataset of files, meticulously labeled by subject matter experts as “Business Critical,” “Personal,” or “Neutral.” This human-curated dataset served as the ground truth for AI.
2. Pre-Processing: Fueling the LLMs
For GenAI solutions, the pre-processing steps are crucial to ensure optimal input for the LLMs (Large Language Models).
The next step involves ingesting files from various enterprise data sources (like Google Drive and internal file shares from collaboration platforms). For each file, the system performs content extraction, converting diverse file formats (PDFs, docs, spreadsheets, presentations, HTML) into plain text.
For image-based files, we integrated OCR (Optical Character Recognition) to convert visual text into machine-readable text before applying Generative AI for summarisation and semantic analysis. Converting a file into plain readable text is crucial for Generative AI to process it effectively.
Large files can’t be fed directly into most LLMs due to token limits. We developed intelligent chunking strategies that maintain contextual coherence within each segment (like splitting by paragraphs, sections, or even semantic topics identified by initial analysis). This ensures the LLM receives meaningful blocks of text.
3. The Generative AI Interpretation Engine: The Brain of FSA
Each pre-processed text chunk is fed into a fine-tuned large LLM. We use models capable of summarization and entity extraction.
Summarization models generate concise summaries of the file’s content. This is invaluable for analysts to quickly grasp the essence of a potentially problematic file without reading through hundreds of pages.
Models also automatically identify and extract named entities (like person names, organization names, dates, financial figures, project codes) and key phrases that indicate sensitivity (like “confidential project plan,” “customer database,” or “unreleased product specifications”).
The LLM’s true power lies in its ability to reason over the extracted information. It can identify relationships. For example, it can understand that “Project X”mentioned in one section is related to “Budget Approval” in another.
It can also infer semantic intent (classification). Based on the combination of content, entities, and context, the LLM can provide a probabilistic assessment of the file’s likely intent and related criticality for the organization. For example, “This appears to be a draft of a highly sensitive merger agreement,” and is critical for the organization. Or, “This looks like a personal travel itinerary with sensitive personal identifiable information (PII), however not critical for organization”.
A critical advantage of GenAI is its ability to explain its reasoning, providing analysts with a transparent justification for its classification or flag. For instance, “This file is classified as ‘Business Critical - Financial’ because it discusses Q3 earnings, references unreleased revenue figures, and lists key investor contacts.”
4. Classification and Policy Enforcement Layer: Acting on Insights
The output from the GenAI Interpretation Engine (summaries, extracted entities, inferred intent, and explanations) is then fed into our rule-based engine to classify and enforce policies, which now benefits from this richer, higher-level understanding. For example:
- Alert if any file summarized as ‘Unreleased Product Specification’ is sent to external domains.
- Alert if a file containing ‘Customer PII’ and summarized as ‘Marketing Campaign List’ is emailed to a personal address.
- Don’t alert if a file summarized as ‘Personal Vacation Photos’ is sent to an external email address, even if it contains a name.
5. Human-in-the-Loop and Continuous Learning: The Feedback Loop Reinforced
The human analyst’s role is transformed from a tedious reviewer to a strategic validator. They’re freed from the noise of content review and are instead empowered by AI-generated findings to instantly validate critical threats and drive rapid decision-making.
Analysts are empowered to provide feedback on the system’s classifications and on the quality of summaries and interpretations. This input is used to refine the LLM’s internal prompts and guide fine-tuning, ensuring the AI’s current outputs are as accurate and relevant as possible.
The Architecture
Figure 1: End-to-end architecture for analyzing data semantically and making decisions.
- File Connector: Ingests files from various egress points.
- File Processing: Handles pre-processing tasks and executes the intelligent text chunking process.
- Prompt Building: Constructs the optimized, context-rich prompt fed to the Generative AI engine.
- Decision (action on insights): Rule-based engine to take decisions and actions as per the output from the GenAI Engine (summaries, extracted entities, inferred intent, and explanations).
- Manual Review (human-in-the-loop): High-fidelity AI findings reviewed by humans before taking actions. Decision goes to a feedback loop.
Managing Risks in AI-Driven Analysis
While Generative AI brings a new level of intelligence to data security, we recognize that relying on Large Language Models (LLMs) introduces specific technical and operational risks. Despite its sophisticated reasoning, the File Semantic Analyzer (FSA) is subject to the inherent limitations of modern AI:
- Addressing LLM Hallucinations: A primary concern is the risk of "hallucinations," where a model might confidently misinterpret a file or invent non-existent sensitive details. We mitigate this by requiring the engine to provide a transparent "explanation" for every classification, allowing analysts to verify the logic against the actual file content.
- Preventing Context Loss: Large files can exceed token limits, potentially causing the AI to miss critical information at the end of a document. We overcome this through intelligent chunking strategies that maintain semantic coherence across segments, ensuring the "brain" of the FSA sees the full picture.
- Human-in-the-Loop Validation: To prevent autonomous errors from leading to business disruption, the system does not act in a total vacuum. High-fidelity findings are routed to human analysts who serve as strategic validators, ensuring that automated decisions are accurate and contextually sound.
Impact and The Road Ahead
The File Analyzer has revolutionized our approach to data security. The shift from mere pattern matching to deep, contextual understanding via Generative AI has yielded significant benefits.
We now have unprecedented contextual insight leading to accelerated incident response times. Security analysts now have an immediate, high-level understanding of file content, including key takeaways and sensitive elements, without needing to open every document. This is a massive time-saver, so incident response time for data exfiltrations is drastically reduced (from multiple hours to a few minutes for each case). On an average, we’re saving 5 minutes for each file and are expecting to analyze 150,000 files annually. That means saving 4 person years for the team.
The LLM’s power lies in its ability to infer intent and reason semantically, moving beyond simple pattern matching. This capability drastically cuts down on false positives by 97% while, critically, ensuring fewer missed true positives—especially with novel data or evasive techniques. The result is a hyper-accurate defense that eliminates noise and reliably exposes genuine threats.
Our journey with FSA is just beginning. We’re continuously exploring things like:
- Multimodal GenAI. Extending the interpretation capabilities to analyze images directly. For example, identifying screenshots of internal tools, videos, and even audio for embedded sensitive information, beyond just text extraction.
- DLP transformation. The future of DLP is semantic enforcement. We’re testing integrating FSA’s high-fidelity AI classification to supersede keyword matching, fundamentally enabling our security systems to block data exfiltration based on a file’s true contextual intent—a vital step toward an autonomous defense.
Conclusion
By integrating Generative AI into our File Semantic Analyzer, we’re not just building a security tool. We’re building an intelligent guardian that truly understands the digital heartbeat of our organization, protecting our most valuable assets with unprecedented precision and insight.
Acknowledgments
Cover Photo Attribution: “Business analysis“ by learn_tek is covered by CC0 1.0.
Deepak Sharma
Engineering Manager II
Deepak Sharma is an Engineering Manager II on the CyberDefense team at Uber and leads the global Security Response & Investigation team
Aditya Kumar
Staff Software Engineer
Aditya Kumar is Staff Software Engineer on the Security Response & Investigation team leading the technology transformation.
Shubham Sonkar
Sr. Security Technologist
Shubham Sonkar is a Sr. Security Technologist on the Security Response & Investigation team.
Products
Company