Data Labeling

Manage labeling workflows, datasets, and content classification for governance compliance

Overview

Data labeling in OpenRails provides tools for classifying and categorizing content within your data lakes. Labels are used by the governance pipeline to enforce security policies, apply de-identification rules, and control access based on content sensitivity.

Labeling Workflow

Navigate to Governance

From the sidebar, go to Governance > Data Labeling.

Select a Dataset

Choose the dataset (data lake or document collection) you want to label. The labeling interface shows documents with their current labels and classification status.

Define Label Categories

Create or select label categories for classification:

Sensitivity Level — Public, Internal, Confidential, Restricted, Top Secret
Content Type — Financial, Medical, Legal, Technical, Personal
PII Presence — Contains PII, No PII, Requires Review
Custom Categories — Define organization-specific labels

Apply Labels

Label documents individually or in bulk:

Manual Labeling — Review documents and assign labels one by one
Bulk Labeling — Select multiple documents and apply the same label
Auto-Labeling — Use pattern-based rules to automatically classify content

Review and Approve

Review auto-labeled documents for accuracy. Approve or correct labels before they are used by the governance pipeline.

Dataset Management

Manage labeled datasets from the governance dashboard:

Action	Description
Create Dataset	Define a new dataset from a data lake or document subset
Export Labels	Export labels as CSV for external analysis or compliance reporting
Label Statistics	View distribution of labels across the dataset
Re-label	Re-run auto-labeling rules after updating patterns

Auto-Labeling Rules

Configure automatic classification rules based on content patterns:

Keyword Rules — Classify documents containing specific keywords or phrases
Pattern Rules — Use regex patterns to detect content types (e.g., SSN patterns for PII detection)
Metadata Rules — Classify based on file type, source, or upload date

Tip: Start with auto-labeling rules for clear-cut cases (e.g., PII detection via regex), then manually review edge cases. This balances speed with accuracy.

Important: Labels drive governance decisions including de-identification and access control. Incorrect labels can result in sensitive data being exposed or legitimate access being blocked. Audit labels regularly.

Next Steps

De-identification — Apply PII masking based on labels
Governance Pipeline — Understand the end-to-end governance flow
Managing Data Lakes — Manage the data lakes being labeled