Data Labeling

Manage labeling workflows, datasets, and content classification for governance compliance

Overview

Data labeling in OpenRails provides tools for classifying and categorizing content within your data lakes. Labels are used by the governance pipeline to enforce security policies, apply de-identification rules, and control access based on content sensitivity.

Labeling Workflow

Navigate to Governance

From the sidebar, go to Governance > Data Labeling.

Select a Dataset

Choose the dataset (data lake or document collection) you want to label. The labeling interface shows documents with their current labels and classification status.

Define Label Categories

Create or select label categories for classification:

  • Sensitivity Level — Public, Internal, Confidential, Restricted, Top Secret
  • Content Type — Financial, Medical, Legal, Technical, Personal
  • PII Presence — Contains PII, No PII, Requires Review
  • Custom Categories — Define organization-specific labels

Apply Labels

Label documents individually or in bulk:

  • Manual Labeling — Review documents and assign labels one by one
  • Bulk Labeling — Select multiple documents and apply the same label
  • Auto-Labeling — Use pattern-based rules to automatically classify content

Review and Approve

Review auto-labeled documents for accuracy. Approve or correct labels before they are used by the governance pipeline.

Dataset Management

Manage labeled datasets from the governance dashboard:

Action Description
Create Dataset Define a new dataset from a data lake or document subset
Export Labels Export labels as CSV for external analysis or compliance reporting
Label Statistics View distribution of labels across the dataset
Re-label Re-run auto-labeling rules after updating patterns

Auto-Labeling Rules

Configure automatic classification rules based on content patterns:

Tip: Start with auto-labeling rules for clear-cut cases (e.g., PII detection via regex), then manually review edge cases. This balances speed with accuracy.
Important: Labels drive governance decisions including de-identification and access control. Incorrect labels can result in sensitive data being exposed or legitimate access being blocked. Audit labels regularly.

Next Steps