Version: V11

Understanding PII Detection and Redaction using VIDIZMO Indexer

The VIDIZMO Indexer App uses AI models to detect Personal Identifiable Information (PII) entities present in your Portal content. You can also use the application to redact these PII entities if you have Redaction as part of your VIDIZMO package. It even provides you with advanced processing options to make the redaction even more accurate. The AI models used by the applications offer support for multiple languages and predefined PII entities for detection.

This functionality also benefits on-premises customers as it allows them to process their content using the VIDIZMO application within their system, ensuring security by avoiding the need to store sensitive data on public clouds or external storage required by services like AWS and Azure, which are needed for using services like AWS and Azure.

Concept

Transcription-based PII Detection

To redact PII from your audio or video, you need to have transcriptions generated for them. The VIDIZMO Indexer app will automatically generate the transcriptions for your audio or video if any PII entity is added to its Insights. You can also use the application itself to generate transcriptions separately or opt for other Indexing applications, such as Azure Video Analyzer ARM or AWS Indexer, provided by VIDIZMO. You can even upload your own closed caption or transcription file for the content you want to process for PII; visit How to Add Closed Captions for more information.

Once you have the transcriptions, you can then redact the PII entities that you have added to the VIDIZMO Indexer's insights. You can also detect these PII entities in the supported languages mentioned below.

During the detection process, the application will also factor in the rest of your configurations, such as the minimum confidence score for the PII detections, context keywords, excluded words, time interval threshold and original file handling. See Configuring VIDIZMO Indexer for PII Detection and Redaction for more information.

Note: After the processing is done, the VIDIZMO Indexer creates a Media Culture attribute for your audio or video file. The media culture attribute indicates the Language or Languages that the Media or Evidence consists of (or has content relating to).

OCR-based PII Detection

VIDIZMO Indexer also includes the ability to detect PII in videos and documents using Optical Character Recognition (OCR). This feature serves as an alternative for videos that contain PII but do not have audio for transcription. With OCR support, the Indexer can process text from on-screen content and detect PII entities from visible text. This ensures comprehensive PII detection regardless of whether the content includes audio.

When PII is detected within an OCR object, the OCR object is converted to a PII object according to its associated PII class (such as Person Name). In Studio Space, these objects are classified as PII objects instead of OCR objects. You can filter and view these PII objects in Studio Space as well.

Note: OCR-based PII detection and redaction is only supported in English.

OCR-based Processing

For videos with no audio: The VIDIZMO Indexer generates OCR for all videos. This makes it so that if a video lacks audio, then the PII detection will be done on the basis of the extracted OCR data. The processing involves the system extracting visible text from the video’s frames (such as on-screen text, captions, or any other textual content) and then performing PII detection on the extracted text. This allows the Indexer to identify and redact PII from videos even when there is no audio for transcription.
For documents: OCR processing is applied directly to the document to extract text. Once the OCR has extracted the text, the Indexer performs PII detection on the recognized content. This ensures that PII is detected from any text-based content, whether it’s embedded in a video or contained within a document.

Since OCR works line by line, PII redaction in documents is performed on a line-by-line basis. The system will detect PII entities within each individual line of text and redacts the entire line accordingly. This ensures accurate and specific redaction for each portion of the document’s text, preserving the structure and format of the original document.

Consumption

PII detection and redaction by the VIDIZMO Indexer app utilizes AI processing as a consumption metric for your VIDIZMO Account. To learn how you can view consumption reports, refer to Consumption Reports for SaaS Deployment Overview.

Supported Languages

Here is a list of languages supported by the VIDIZMO Indexer for PII detection and redaction.

English
German
Spanish
French
Italian
Japanese
Korean
Russian
Swedish
Chinese

Note: If support for a language is unavailable, contact VIDIZMO support.

PII Entities

Here is the list of predefined PII Entities available by default.

Note: To add additional or custom PIIs for detection, you can contact VIDIZMO support.

Address
Age
Australian Business Number
Australian Company Number
Australian Medicare
Australian Tax File Number
Credit Card Number
Crypto Wallet Number
Date Time
Email Address
IBAN Code
Indian AADHAAR
Indian Permanent Account Number
IP Address
Italian Driver License
Italian Fiscal Code
Italian Identity Card
Italian Passport Number
Italian VAT Code
Medical License
NRP (Nationality, Religious or Political Group)
Organization
Person’s Name
Phone Number
Polish National Identification Number
Profession
Singaporean Unique Registered Entity Number
Spanish Personal Tax ID (Número de Identificación Fiscal)
UK National Health Service Number
Unique Identifier
URL
US Bank Number
US Driver License
US Individual Taxpayer Identification Number
US Passport Number
US Social Security Number
User Name
Zip Code

Confidence Score

The confidence score or confidence threshold is a value that the AI model uses to determine if the detected object or word is a PII or not. When the input text is analyzed for PII detection, the model breaks the text down into individual components called tokens and assigns a confidence score to each of them. The model then analyzes the score of these tokens and compares it with the score threshold to determine which of these detected objects is a PII entity. A word or a token is classified as a PII entity if its confidence score is higher than that of the score threshold.

Increasing the score threshold means that the model will only classify fewer detections as PII, but this also means that it will pick the more accurate ones. You can keep the score threshold at a high value if you want to ensure that the model only picks out the tokens that it is very confident are PII. On the other hand, lowering the score threshold means that the model is likely to classify more tokens as PII; this is useful in the case when you want to ensure that the model has no chance of leaving out a token that might be a PII entity.

A high confidence score threshold for PII is suitable for text that may contain fewer instances of PII. In comparison, a low confidence threshold is ideal for text that may contain more instances of PII. The confidence score can have a value from 0 to 100, but it is highly recommended that you use 45 for the best results.

Excluded Words

The VIDIZMO Indexer also provides you with a field where you can enter a list of words that will not be classified as a PII entity. For example, if you have configured the application to detect 'Organization' as PII, then the word 'VIDIZMO' will be detected and then redacted. However, if 'VIDIZMO' is present in the excluded PII list, this word will be skipped over and not identified as a PII by the application.

Please ensure that you correctly capitalize words, as this field is case-sensitive because an exact match is required for the words to be excluded from PII detection. This applies even when the words are spelled similarly but have different capitalization. For example, if you want 'MARCORP' to not be identified as PII, then you need to add 'MARCORP' in the excluded words field. It should not be 'marcorp' or ‘Marcorp' as the AI model identifies these as separate entities.

Context Keywords

You can add context keywords that can enhance the confidence score of the PII entities if they are found within range of them. Leveraging context words to increase the confidence score makes the PII detection more accurate. For the score enhancement to happen, the context keywords need to be present with a range of approximately 5 to 10 words (both before and after) of the PII entity. You can provide a list of the relevant context words in the Speech & Text Analyzer application, and they will be utilized to enhance the relevant PII entities.

It is recommended that you enter words relevant to the PII entities you have configured for detection for the most precision and efficiency. For instance, if you want to detect 'PHONE_NUMBER' as a PII entity, you need relevant context words such as phone, number, or contact. Take a look at the sentence below:

"Can you write down my contact? It is 555-555-5555."

In this sentence, the context word to enhance the PII detection is contact. The context word is relevant within the context of the PII.

Note: The relevancy of the context keywords is essential. It is recommended that you do not overfit the list words as it can interfere with the detection process and not make any confidence score enhancement possible.

To see how you can perform PII detection and redaction on your Portal, visit How to Perform PII Redaction using VIDIZMO Indexer

Creating and Managing Custom PII

Custom PII are user-defined entities that extend VIDIZMO’s built-in PII set. They work across transcript-based detection and OCR, and appear as a selectable insight called Custom PII in the VIDIZMO Indexer. When you select Custom PII, the VIDIZMO Indexer includes your configured custom entities in detection and redaction, along with any predefined PII types you select.

Concept

Custom PII are created using custom patterns that use one of three matching methods. Each produces detections with a score that is evaluated against the confidence score threshold in the VIDIZMO Indexer settings, similar to preconfigured PII.

Word list

Exact words/phrases you supply. Useful for proper nouns, codenames, and sensitive vocabulary.

Regex pattern

Custom PII created using regular expressions. Useful for structured formats (alphanumeric IDs, dates, codes).

Context Words with Regex

Use this method when a regex pattern alone isn’t reliable for finding matches. The VIDIZMO Indexer first detects text that matches your regex pattern, then checks for nearby context words to strengthen the confidence score.

If any of the words appearing before the regex match (up to five words) are found in the defined context words, the regex score is boosted. This makes detection more accurate, especially when the same pattern could appear in different contexts.

Note: For best results, set the regex score lower than the confidence threshold (for example, a regex score of 0.01 with a confidence threshold of 0.3). If the regex score is equal to or higher than the confidence threshold, the text will always be detected even without context words. Keeping the regex score low ensures the pattern triggers only when a relevant context word is present.

Example

Custom PII: Date of birth using Context words with regex Text: “My date of birth is 8/01/2025.”

If the regex score is low (for example, 0.01), context words such as date and birth increase the detection score and confirm the match.
If the regex score is high (for example, 0.8), the regex match is strong enough that context words have little or no impact.

This approach helps you fine-tune how much influence context words have over detection confidence.

Note: Context words only support individual words, not phrases. For example, “SSN” is treated as one word, but “Social Security Number” is treated as three separate words.

All custom patterns Word list, Regex pattern, and Context words with Regex are converted to regex under the hood. That means they obey regex flags consistently.

For supported flags and usage, see How to Create Custom Patterns.

Overlapping detections and exclusions

When detections overlap (for example, a predefined Phone Number and a custom regex match), the Indexer keeps the result with the higher confidence.

Excluded words always take priority: if any token is listed in Excluded Words field, the Indexer does not treat it as PII even if a predefined or custom entity would match.

Concept​

Transcription-based PII Detection​

OCR-based PII Detection​

OCR-based Processing​

Consumption​

Supported Languages​

PII Entities​

Confidence Score​

Excluded Words​

Context Keywords​

Creating and Managing Custom PII​

Concept​

Word list​

Regex pattern​

Context Words with Regex​

Overlapping detections and exclusions​

Concept

Transcription-based PII Detection

OCR-based PII Detection

OCR-based Processing

Consumption

Supported Languages

PII Entities

Confidence Score

Excluded Words

Context Keywords

Creating and Managing Custom PII

Concept

Word list

Regex pattern

Context Words with Regex

Overlapping detections and exclusions