Skip to main content
Version: V11

Understanding PII Detection and Redaction using VIDIZMO Indexer

The VIDIZMO Indexer App uses AI models to detect Personal Identifiable Information (PII) entities present in your Portal content. You can also use the application to redact these PII entities if you have Redaction as part of your VIDIZMO package. It even provides you with advanced processing options to make the redaction even more accurate. The AI models used by the applications offer support for multiple languages and predefined PII entities for detection.

This functionality also benefits on-premises customers as it allows them to process their content using the VIDIZMO application within their system, ensuring security by avoiding the need to store sensitive data on public clouds or external storage required by services like AWS and Azure, which are needed for using services like AWS and Azure.

Concept

Transcription-based PII Detection

To redact PII from your audio or video, you need to have transcriptions generated for them. The VIDIZMO Indexer app will automatically generate the transcriptions for your audio or video if any PII entity is added to its Insights. You can also use the application itself to generate transcriptions separately or opt for other Indexing applications, such as Azure Video Analyzer ARM or AWS Indexer, provided by VIDIZMO. You can even upload your own closed caption or transcription file for the content you want to process for PII; visit How to Add Closed Captions for more information.

Once you have the transcriptions, you can then redact the PII entities that you have added to the VIDIZMO Indexer's insights. You can also detect these PII entities in the supported languages mentioned below.

During the detection process, the application will also factor in the rest of your configurations, such as the minimum confidence score for the PII detections, context keywords, excluded words, time interval threshold and original file handling. See Configuring VIDIZMO Indexer for PII Detection and Redaction for more information.

Note: After the processing is done, the VIDIZMO Indexer creates a Media Culture attribute for your audio or video file. The media culture attribute indicates the Language or Languages that the Media or Evidence consists of (or has content relating to).

OCR-based PII Detection

VIDIZMO Indexer also includes the ability to detect PII in videos and documents using Optical Character Recognition (OCR). This feature serves as an alternative for videos that contain PII but do not have audio for transcription. With OCR support, the Indexer can process text from on-screen content and detect PII entities from visible text. This ensures comprehensive PII detection regardless of whether the content includes audio.

When PII is detected within an OCR object, the OCR object is converted to a PII object according to its associated PII class (such as Person Name). In Studio Space, these objects are classified as PII objects instead of OCR objects. You can filter and view these PII objects in Studio Space as well.

Note: OCR-based PII detection and redaction is only supported in English.

OCR-based Processing

  • For videos with no audio: The VIDIZMO Indexer generates OCR for all videos. This makes it so that if a video lacks audio, then the PII detection will be done on the basis of the extracted OCR data. The processing involves the system extracting visible text from the video’s frames (such as on-screen text, captions, or any other textual content) and then performing PII detection on the extracted text. This allows the Indexer to identify and redact PII from videos even when there is no audio for transcription.

  • For documents: OCR processing is applied directly to the document to extract text. Once the OCR has extracted the text, the Indexer performs PII detection on the recognized content. This ensures that PII is detected from any text-based content, whether it’s embedded in a video or contained within a document.

Since OCR works line by line, PII redaction in documents is performed on a line-by-line basis. The system will detect PII entities within each individual line of text and redacts the entire line accordingly. This ensures accurate and specific redaction for each portion of the document’s text, preserving the structure and format of the original document.

Consumption

PII detection and redaction by the VIDIZMO Indexer app utilizes AI processing as a consumption metric for your VIDIZMO Account. To learn how you can view consumption reports, refer to Consumption Reports for SaaS Deployment Overview.

Supported Languages

Here is a list of languages supported by the VIDIZMO Indexer for PII detection and redaction.

  • English
  • German
  • Spanish
  • French
  • Italian
  • Japanese
  • Korean
  • Russian
  • Swedish
  • Chinese

Note: If support for a language is unavailable, contact VIDIZMO support.

PII Entities

Here is the list of predefined PII Entities available by default.

Note: To add additional or custom PIIs for detection, you can contact VIDIZMO support.

  • Address
  • Age
  • Australian Business Number
  • Australian Company Number
  • Australian Medicare
  • Australian Tax File Number
  • Credit Card Number
  • Crypto Wallet Number
  • Date Time
  • Email Address
  • IBAN Code
  • Indian AADHAAR
  • Indian Permanent Account Number
  • IP Address
  • Italian Driver License
  • Italian Fiscal Code
  • Italian Identity Card
  • Italian Passport Number
  • Italian VAT Code
  • Medical License
  • NRP (Nationality, Religious or Political Group)
  • Organization
  • Person’s Name
  • Phone Number
  • Polish National Identification Number
  • Profession
  • Singaporean Unique Registered Entity Number
  • Spanish Personal Tax ID (Número de Identificación Fiscal)
  • UK National Health Service Number
  • Unique Identifier
  • URL
  • US Bank Number
  • US Driver License
  • US Individual Taxpayer Identification Number
  • US Passport Number
  • US Social Security Number
  • User Name
  • Zip Code

Confidence Score

The confidence score or confidence threshold is a value that the AI model uses to determine if the detected object or word is a PII or not. When the input text is analyzed for PII detection, the model breaks the text down into individual components called tokens and assigns a confidence score to each of them. The model then analyzes the score of these tokens and compares it with the score threshold to determine which of these detected objects is a PII entity. A word or a token is classified as a PII entity if its confidence score is higher than that of the score threshold.

Increasing the score threshold means that the model will only classify fewer detections as PII, but this also means that it will pick the more accurate ones. You can keep the score threshold at a high value if you want to ensure that the model only picks out the tokens that it is very confident are PII. On the other hand, lowering the score threshold means that the model is likely to classify more tokens as PII; this is useful in the case when you want to ensure that the model has no chance of leaving out a token that might be a PII entity.

A high confidence score threshold for PII is suitable for text that may contain fewer instances of PII. In comparison, a low confidence threshold is ideal for text that may contain more instances of PII. The confidence score can have a value from 0 to 100, but it is highly recommended that you use 45 for the best results.

Excluded Words

The VIDIZMO Indexer also provides you with a field where you can enter a list of words that will not be classified as a PII entity. For example, if you have configured the application to detect 'Organization' as PII, then the word 'VIDIZMO' will be detected and then redacted. However, if 'VIDIZMO' is present in the excluded PII list, this word will be skipped over and not identified as a PII by the application.

Please ensure that you correctly capitalize words, as this field is case-sensitive because an exact match is required for the words to be excluded from PII detection. This applies even when the words are spelled similarly but have different capitalization. For example, if you want 'MARCORP' to not be identified as PII, then you need to add 'MARCORP' in the excluded words field. It should not be 'marcorp' or ‘Marcorp' as the AI model identifies these as separate entities.

Context Keywords

You can add context keywords that can enhance the confidence score of the PII entities if they are found within range of them. Leveraging context words to increase the confidence score makes the PII detection more accurate. For the score enhancement to happen, the context keywords need to be present with a range of approximately 5 to 10 words (both before and after) of the PII entity. You can provide a list of the relevant context words in the Speech & Text Analyzer application, and they will be utilized to enhance the relevant PII entities.

It is recommended that you enter words relevant to the PII entities you have configured for detection for the most precision and efficiency. For instance, if you want to detect 'PHONE_NUMBER' as a PII entity, you need relevant context words such as phone, number, or contact. Take a look at the sentence below:

"Can you write down my contact? It is 555-555-5555."

In this sentence, the context word to enhance the PII detection is contact. The context word is relevant within the context of the PII.

Note: The relevancy of the context keywords is essential. It is recommended that you do not overfit the list words as it can interfere with the detection process and not make any confidence score enhancement possible.

To see how you can perform PII detection and redaction on your Portal, visit How to Perform PII Redaction using VIDIZMO Indexer

Creating and Managing Custom PII

Custom PII are user-defined entities that extend VIDIZMO’s built-in PII set. They work across transcript-based detection and OCR, and appear as a selectable insight called Custom PII in the VIDIZMO Indexer. When you select Custom PII, the VIDIZMO Indexer includes your configured custom entities in detection and redaction, along with any predefined PII types you select.

Concept

Custom PII are created using custom patterns that use one of three matching methods. Each produces detections with a score that is evaluated against the confidence score threshold in the VIDIZMO Indexer settings, similar to preconfigured PII.

Word list

Exact words/phrases you supply. Useful for proper nouns, codenames, and sensitive vocabulary.

Regex pattern

Custom PII created using regular expressions. Useful for structured formats (alphanumeric IDs, dates, codes).

Context Words with Regex

Useful when a pattern alone isn’t reliable to find matches. in this method, the regex finds matches then the VIDIZMO Indexer looks for nearby context words that strengthen or weaken the match. It then weighs the regex match against nearby context words and decides which to prioritize based on the regex confidence score you set for the pattern

  • Higher values (close to 1.0): The regex pattern takes precedence. The Indexer gives more weight to the pattern match and less to context words.
  • Lower values (for example, 0.5 or 0.4): Context words take precedence. The Indexer relies more on nearby terms to confirm the match.

Example

Custom PII: Date of birth using Context words with regex Text: “My date of birth is 8/01/2025.”

  • If the regex confidence score is low, context words such as date and birth carry more weight and help confirm the detection.
  • If the regex confidence score is high, the date-format regex itself carries more weight, even if context words are weak or missing.

This setting helps you tune how strongly context words influence detection.

Note: All custom patterns— Word list, Regex pattern, and Context words with Regex—are converted to regex under the hood. That means they obey regex flags consistently. For supported flags and usage, see How to Create Custom Patterns

Overlapping detections and exclusions

When detections overlap (for example, a predefined Phone Number and a custom regex match), the Indexer keeps the result with the higher confidence.

Excluded words always take priority: if any token is listed in Excluded Words field, the Indexer does not treat it as PII—even if a predefined or custom entity would match.