'Redacting PII Before Using Images for AI Training'
AI teams often inherit image data from the rest of the business: support uploads, dashcam clips, inspection photos, facility footage, public submissions, or archive media. That data may be useful for training or evaluation, but it usually contains people who never agreed to be part of a model dataset.
Before images move into annotation tools, model-training buckets, or vendor environments, remove visual PII. Redaction is easier to do at the boundary than after a dataset has been copied into five downstream systems.
What counts as visual PII in an AI dataset?
Faces and license plates are the obvious categories. They are not the only ones that matter.
Training and evaluation datasets often contain:
- Faces and heads in background scenes
- Vehicle license plates
- Name badges and employee IDs
- Passports, ID cards, and credit cards
- Screens showing dashboards, tickets, email, or patient records
- Whiteboards, documents, labels, and handwritten notes
- QR codes and barcodes that encode account, shipment, or contact data
- Tattoos or distinctive markings
- Street signs and location markers
A dataset can be privacy-sensitive even when the model target is not a person. A road-damage model still captures plates. A retail shelf model still captures shoppers. A home-inspection model still captures family photos, mail, and documents on a desk.
Put redaction before annotation
The safest sequence is:
- Ingest raw media into restricted storage.
- Run automated redaction.
- Store redacted derivatives in a separate dataset bucket.
- Send only redacted files to annotation, training, and evaluation.
- Keep the raw originals under a shorter retention policy.
Do not wait until after annotation. Labeling vendors, contractors, and internal reviewers may see everything in the frame. If an image contains a face, plate, badge, or document, the privacy exposure has already happened by the time someone draws the first bounding box.
Keep originals and training data separate
Use different buckets, prefixes, or storage accounts for raw and redacted media:
s3://restricted-raw-media/fleet/2026/06/08/clip-001.mp4
s3://ml-redacted-datasets/fleet/2026/06/08/clip-001.mp4
Give annotation tools and model-training jobs access only to the redacted location. If the training job cannot read originals, an accidental config change cannot leak raw media into model artifacts.
Choose categories based on the model task
Redact the PII that is unrelated to the model objective.
Street or mapping models. Redact faces and license plates by default. Decide whether street signs should remain based on the model's purpose. A navigation model may need sign text; a pavement-condition model probably does not.
Insurance and claims models. Redact faces, plates, documents, ID cards, screens, and credit cards. Damage photos frequently include unrelated property, mail, and vehicle information.
Retail and facility models. Redact faces, name badges, screens, documents, and visible writing. Cameras in stores and clinics catch more internal information than teams expect.
Real estate and home-imagery models. Redact faces, license plates, documents, screens, and street signs. Interior shots often include mail, diplomas, family photos, and device screens.
If the PII category is part of the model target, use a privacy review before deciding. For example, a license-plate recognition model cannot train on fully blurred plates, but that project has a very different consent and governance burden than a generic image classifier.
Preserve dataset usefulness
Redaction changes pixels. That is the point, but the change can affect model performance if you redact too broadly.
A practical approach:
- Redact only selected PII categories, not entire images.
- Keep category choices stable within a dataset version.
- Save redaction settings with the dataset metadata.
- Run a small evaluation before and after redaction.
- Compare model metrics on the task you actually care about.
If redaction hurts performance, inspect examples. The issue may be that a category is too broad for the use case, not that redaction is wrong. For instance, redacting street signs may hurt a street-sign recognition model but have no measurable effect on a road-surface defect model.
Version your redacted datasets
Treat redaction settings as part of dataset versioning.
Record:
- Source dataset version
- Redaction date
- PII categories selected
- Redaction method
- Processing job IDs
- Sampling review results
- Known limitations
This matters later. If a model behavior changes, you need to know whether the training data changed because of new labels, new images, new redaction settings, or all three.
Use sampling review, not blind trust
Automated redaction should reduce exposure, not remove accountability. Build a review step into dataset creation.
For a small dataset, review every file. For a large dataset, sample enough files to catch patterns:
- Low-light footage
- Motion blur
- Wide-angle or fisheye images
- Small distant faces
- Reflective screens
- Dense street scenes
- Scans or photos of documents
Track misses by category. If plates are consistently missed in night footage, adjust the workflow before the dataset moves downstream.
Retention matters
Redaction does not answer every privacy question. You still need a retention policy for the raw source media.
Ask:
- Why do we need to keep the original?
- Who can access it?
- When is it deleted?
- Can downstream teams do their work from the redacted derivative?
- Are backups and replicas covered by the same policy?
Many teams keep raw media forever because nobody owns deletion. That is a process bug, not a technical requirement.
Automating the workflow
The PiiBlur API is designed for this boundary step. Upload source files, select categories, receive a webhook when processing completes, and write redacted outputs into the dataset bucket.
For image datasets, start with the Image Redaction API. For video clips, use the Video Redaction API. If your dataset mostly contains people or vehicles, the Face Blur API and License Plate Blur API examples cover the common request shape.