Data Extractor Node
The Data Extractor Node extracts specific fields from structured data objects and converts them into text or structured format for downstream processing. It supports dot notation for accessing nested fields and provides extensive formatting control through separators, metadata inclusion, and record identifiers. This selective extraction reduces data volume and focuses processing on relevant information.
How It Works
When the node executes, it reads objects from the input variable and extracts the specified fields from each object using dot notation paths. For nested data like arrays within objects, the node automatically handles iteration and flattening based on configuration.
Two output modes are available: flat text format that concatenates all extracted values into a single string suitable for LLM prompts, or structured format that preserves the original object hierarchy for programmatic processing. The node applies filtering and formatting rules during extraction, removing empty values, truncating long fields, and adding separators or metadata labels as configured.
Record identifiers like [Document #1] help track which extracted content came from which source object, making it easier to reference specific documents in LLM responses.
Configuration Parameters
Input field
Input Field (Text, Required): Workflow variable containing structured data.
The node expects objects or arrays of objects with extractable fields. Primitive types like strings or numbers are not supported.
Output field
Output Field (Text, Required): Workflow variable where extracted data is stored.
The output is either flat text (when Preserve Structure is disabled) or structured data (when Preserve Structure is enabled).
Common naming patterns: extracted_text, extracted_data, formatted_content.
Fields to extract
Fields to Extract (Array, Required): Field names to extract using dot notation for nested fields.
Each field path specifies the exact location of data within the object structure. Example: Title extracts a top-level field, while Content.ContentFiles.url navigates through nested objects. The node supports complex nested structures and handles arrays at any level of the path.
Preserve structure
Preserve Structure (Toggle, Default: false): Maintain original object structure or flatten to text.
| Mode | Output format | Use when |
|---|---|---|
| Disabled | Flat text with configurable separators and metadata | Preparing data for LLM prompts, creating readable summaries |
| Enabled | Structured objects/arrays preserving hierarchy | Passing data to nodes for programmatic processing, maintaining relationships |
When disabled, additional formatting options become available.
Keep parent structure
Keep Parent Structure (Toggle, Optional): Keep full nested path or use only last key.
Only applicable when Preserve Structure is enabled. When enabled, preserves complete hierarchy like {"content": {"contentDetails": {...}}}. When disabled, uses only the last key like {"contentFiles": [...]}.
Include metadata
Include Metadata (Toggle, Optional): Include field names as labels in extracted text.
Only applicable when Preserve Structure is disabled. When enabled, output includes labels like Name: John, Age: 30. When disabled, output contains only values like John, 30. Labels help LLMs understand field context but increase token usage.
Field separator
Field Separator (Dropdown, Default: Comma): Separator between fields within a single object.
Only applicable when Preserve Structure is disabled.
| Separator | Output example | Use when |
|---|---|---|
| Space | John 30 Engineer | Creating compact output |
| Comma | John, 30, Engineer | Standard CSV-style formatting |
| New Line | Each field on separate line | Vertical layout for readability |
| Pipe | John | 30 | Engineer | Clear visual separation |
Object separator
Object Separator (Dropdown, Default: New Line): Separator between multiple objects.
Only applicable when Preserve Structure is disabled.
| Separator | Output example | Use when |
|---|---|---|
| New Line | Each object on separate line | Standard line-by-line output |
| Double New Line | Blank line between objects | Better readability with spacing |
| Triple New Line | Two blank lines between objects | Document-style separation |
| Space | Inline format | Compact output |
| Comma | CSV-style | List format |
| Pipe | Clear visual boundaries | Strong field separation |
| Section (---) | Strong visual separation | Document sections |
Max field length
Max Field Length (Number, Optional): Maximum characters per field.
Only applicable when Preserve Structure is disabled. Fields exceeding this length are truncated with ... appended. Leave empty for no limit. Prevents individual fields from consuming excessive tokens in LLM prompts.
Include record IDs
Include Record IDs (Toggle, Optional): Include identifiers like [Document #1] with each object.
Only applicable when Preserve Structure is disabled. Record IDs help track and reference specific documents in LLM responses.
Record ID prefix
Record ID Prefix (Text, Optional): Prefix text for record identifiers.
Only applicable when Include Record IDs is enabled. Results in identifiers like [Document #1], [Profile #1], or [Resume #1]. Leave empty to use just the number like [#1].
Record ID field
Record ID Field (Text, Optional): Field name to use as identifier instead of sequential numbers.
Only applicable when Include Record IDs is enabled. If specified and the field exists, the node uses that value (e.g., CaseId, DocumentId). Falls back to sequential numbering if the field doesn't exist.
Filter empty values
Filter Empty Values (Toggle, Default: false): Exclude null or empty fields from output.
Skips fields with null values, empty strings, or empty collections. Creates cleaner output and prevents LLMs from processing irrelevant null values.
Common parameters
This node supports common parameters shared across workflow nodes, including Stream Output Response, Streaming Messages, Logging Mode, and Wait For All Edges. For detailed information, see Common Parameters.
Best practices
- Extract only needed fields rather than entire objects to reduce token usage and improve LLM performance
- Test field paths with a small dataset to ensure correct extraction level in nested arrays
- Use Include Metadata when sending data to LLMs for field context; disable when self-explanatory to save tokens
- Enable Filter Empty Values to create cleaner output and prevent LLMs from processing null values
- Apply Max Field Length limits for fields like descriptions that may vary significantly in length
- Enable Include Record IDs with meaningful prefixes so LLMs can cite specific sources in responses
Limitations
- Input type restriction: Input must be objects or dictionaries with extractable fields. Primitive types like strings, numbers, or booleans are not supported.
- Field path validation: Invalid field paths that don't exist in objects return null values. Verify paths match your data structure.
- Array flattening behavior: When extracting a single field from nested arrays with Preserve Structure disabled, all values are completely flattened into a single array.
- Max field length truncation: Length limits truncate fields mid-content without word boundary awareness, potentially cutting words.
- Record ID fallback: When Record ID Field is specified but doesn't exist in objects, the node falls back to sequential numbering.