Skip to main content
Version: V11

Data Extractor Node

The Data Extractor Node extracts specific fields from structured data objects and converts them into text or structured format for downstream processing. It supports dot notation for accessing nested fields and provides extensive formatting control through separators, metadata inclusion, and record identifiers. This selective extraction reduces data volume and focuses processing on relevant information.

How It Works

When the node executes, it reads objects from the input variable and extracts the specified fields from each object using dot notation paths. For nested data like arrays within objects, the node automatically handles iteration and flattening based on configuration.

Two output modes are available: flat text format that concatenates all extracted values into a single string suitable for LLM prompts, or structured format that preserves the original object hierarchy for programmatic processing. The node applies filtering and formatting rules during extraction, removing empty values, truncating long fields, and adding separators or metadata labels as configured.

Record identifiers like [Document #1] help track which extracted content came from which source object, making it easier to reference specific documents in LLM responses.

Configuration Parameters

Input field

Input Field (Text, Required): Workflow variable containing structured data.

The node expects objects or arrays of objects with extractable fields. Primitive types like strings or numbers are not supported.

Output field

Output Field (Text, Required): Workflow variable where extracted data is stored.

The output is either flat text (when Preserve Structure is disabled) or structured data (when Preserve Structure is enabled).

Common naming patterns: extracted_text, extracted_data, formatted_content.

Fields to extract

Fields to Extract (Array, Required): Field names to extract using dot notation for nested fields.

Each field path specifies the exact location of data within the object structure. Example: Title extracts a top-level field, while Content.ContentFiles.url navigates through nested objects. The node supports complex nested structures and handles arrays at any level of the path.

Preserve structure

Preserve Structure (Toggle, Default: false): Maintain original object structure or flatten to text.

ModeOutput formatUse when
DisabledFlat text with configurable separators and metadataPreparing data for LLM prompts, creating readable summaries
EnabledStructured objects/arrays preserving hierarchyPassing data to nodes for programmatic processing, maintaining relationships

When disabled, additional formatting options become available.

Keep parent structure

Keep Parent Structure (Toggle, Optional): Keep full nested path or use only last key.

Only applicable when Preserve Structure is enabled. When enabled, preserves complete hierarchy like {"content": {"contentDetails": {...}}}. When disabled, uses only the last key like {"contentFiles": [...]}.

Include metadata

Include Metadata (Toggle, Optional): Include field names as labels in extracted text.

Only applicable when Preserve Structure is disabled. When enabled, output includes labels like Name: John, Age: 30. When disabled, output contains only values like John, 30. Labels help LLMs understand field context but increase token usage.

Field separator

Field Separator (Dropdown, Default: Comma): Separator between fields within a single object.

Only applicable when Preserve Structure is disabled.

SeparatorOutput exampleUse when
SpaceJohn 30 EngineerCreating compact output
CommaJohn, 30, EngineerStandard CSV-style formatting
New LineEach field on separate lineVertical layout for readability
PipeJohn | 30 | EngineerClear visual separation

Object separator

Object Separator (Dropdown, Default: New Line): Separator between multiple objects.

Only applicable when Preserve Structure is disabled.

SeparatorOutput exampleUse when
New LineEach object on separate lineStandard line-by-line output
Double New LineBlank line between objectsBetter readability with spacing
Triple New LineTwo blank lines between objectsDocument-style separation
SpaceInline formatCompact output
CommaCSV-styleList format
PipeClear visual boundariesStrong field separation
Section (---)Strong visual separationDocument sections

Max field length

Max Field Length (Number, Optional): Maximum characters per field.

Only applicable when Preserve Structure is disabled. Fields exceeding this length are truncated with ... appended. Leave empty for no limit. Prevents individual fields from consuming excessive tokens in LLM prompts.

Include record IDs

Include Record IDs (Toggle, Optional): Include identifiers like [Document #1] with each object.

Only applicable when Preserve Structure is disabled. Record IDs help track and reference specific documents in LLM responses.

Record ID prefix

Record ID Prefix (Text, Optional): Prefix text for record identifiers.

Only applicable when Include Record IDs is enabled. Results in identifiers like [Document #1], [Profile #1], or [Resume #1]. Leave empty to use just the number like [#1].

Record ID field

Record ID Field (Text, Optional): Field name to use as identifier instead of sequential numbers.

Only applicable when Include Record IDs is enabled. If specified and the field exists, the node uses that value (e.g., CaseId, DocumentId). Falls back to sequential numbering if the field doesn't exist.

Filter empty values

Filter Empty Values (Toggle, Default: false): Exclude null or empty fields from output.

Skips fields with null values, empty strings, or empty collections. Creates cleaner output and prevents LLMs from processing irrelevant null values.

Common parameters

This node supports common parameters shared across workflow nodes, including Stream Output Response, Streaming Messages, Logging Mode, and Wait For All Edges. For detailed information, see Common Parameters.

Best practices

  • Extract only needed fields rather than entire objects to reduce token usage and improve LLM performance
  • Test field paths with a small dataset to ensure correct extraction level in nested arrays
  • Use Include Metadata when sending data to LLMs for field context; disable when self-explanatory to save tokens
  • Enable Filter Empty Values to create cleaner output and prevent LLMs from processing null values
  • Apply Max Field Length limits for fields like descriptions that may vary significantly in length
  • Enable Include Record IDs with meaningful prefixes so LLMs can cite specific sources in responses

Limitations

  • Input type restriction: Input must be objects or dictionaries with extractable fields. Primitive types like strings, numbers, or booleans are not supported.
  • Field path validation: Invalid field paths that don't exist in objects return null values. Verify paths match your data structure.
  • Array flattening behavior: When extracting a single field from nested arrays with Preserve Structure disabled, all values are completely flattened into a single array.
  • Max field length truncation: Length limits truncate fields mid-content without word boundary awareness, potentially cutting words.
  • Record ID fallback: When Record ID Field is specified but doesn't exist in objects, the node falls back to sequential numbering.