Text Splitter Node
The Text Splitter Node divides large text documents into smaller, manageable chunks using intelligent strategies with token-accurate or character-based length measurement. It supports recursive (semantic) and fixed (simple) splitting strategies, automatically calculating optimal chunk sizes from model context windows. Cross-chunking mode combines small objects like subtitle segments into larger chunks while preserving metadata ranges.
How It Works
When the node executes, it reads text from the specified input variable and applies the configured splitting strategy. Recursive splitting tries multiple separators hierarchically (paragraphs, sentences, words) to keep semantic units intact, while fixed splitting uses a single separator for straightforward division.
The node can automatically calculate optimal chunk sizes based on the model's context window and reserve ratio, or accept explicit sizes manually. Chunks are created with optional overlap to preserve context at boundaries. Token-based measurement uses model-specific tokenizers for accurate counting, while character-based measurement uses simple string length for faster processing.
Two modes are supported: standard text chunking for plain text, and cross-chunking for combining multiple small objects (like subtitle segments or PDF pages) into larger chunks while preserving metadata ranges.
Configuration Parameters
Input field
Input Field (Text, Required): Workflow variable containing text content.
The node expects text as a string, or an array of objects when cross-chunking is enabled. For cross-chunking, provide objects containing text and metadata fields.
Output field
Output Field (Text, Required): Workflow variable where chunks are stored.
The output is an array of text chunks (strings) for standard chunking, or an array of combined objects for cross-chunking. Each chunk respects configured size limits and overlap settings.
Common naming patterns: chunks, text_chunks, combined_segments.
Splitting mode
Splitting Mode (Dropdown, Default: Character-based): How to measure chunk length.
| Mode | How it works | Best for | Performance |
|---|---|---|---|
| Character-based | Uses simple string length | Embeddings, simple text processing | Fastest |
| Token-based | Uses tiktoken or HuggingFace tokenizers | LLM processing, RAG, accurate token limits | Slower |
Token-based ensures chunks fit within model token limits; character-based provides faster processing when exact counts aren't critical.
Splitting strategy
Splitting Strategy (Dropdown, Default: Recursive): How to divide text into chunks.
| Strategy | How it works | Best for |
|---|---|---|
| Recursive (Semantic) | Tries multiple separators hierarchically | Most use cases - preserves paragraphs, sentences, meaning |
| Fixed (Simple) | Straightforward fixed-size splitting | Simple splitting where semantic boundaries don't matter |
Recursive strategy is recommended for most applications as it maintains document structure and semantic coherence.
Chunk size mode
Chunk Size Mode (Dropdown, Default: Manual): How to determine chunk size.
| Mode | How it works | Use when |
|---|---|---|
| Manual (Explicit) | Use explicit Chunk Size value | Precise control over chunk size needed |
| Auto (Calculated) | Calculates from Model Context Window × (1 - Token Reserve Ratio) | Chunks optimized for model capacity |
Chunk size
Chunk Size (Number, Conditional): Maximum size per chunk.
Required for Manual size mode. Units depend on Splitting Mode: tokens if Token-based, characters if Character-based. Range: 100-200,000. Variable interpolation with ${chunk_size} is supported.
Model context window
Model Context Window (Number, Conditional): Model's maximum context window in tokens.
Required for Auto size mode. Common values: GPT-4: 8,192; GPT-4-turbo: 128,000; Claude-3: 200,000; Llama-3.2: 128,000. Range: 1-200,000 tokens.
Token reserve ratio
Token Reserve Ratio (Number, Default: 0.3): Ratio of tokens to reserve for prompts/responses (0.0-0.9).
Formula: Chunk Size = Model Context Window × (1 - Token Reserve Ratio). Examples: 0.3 (good for RAG/Q&A), 0.2 (good for summarization), 0.4 (good for complex prompts).
Overlap size
Overlap Size (Number, Default: 0): Overlapping units between consecutive chunks.
Helps preserve context at chunk boundaries. Typical values: 10-20% of chunk size. Range: 0-10,000. Variable interpolation with ${overlap} is supported.
Split separators
Split Separators (Array, Default: ["\n\n", "\n", "."]): Separators to split on, in priority order.
Used by Recursive strategy. Maximum 10 separators. Use \\n for newlines, \\t for tabs. The recursive strategy tries each separator in order, using the first that produces chunks within target size.
Keep separator
Keep Separator (Toggle, Default: true): Retain separator at end of each chunk.
Keeping separators preserves document structure (paragraphs, sentences). When disabled, separators are removed, producing cleaner text but losing structural markers.
Use regex separators
Use Regex Separators (Toggle, Default: false): Treat separators as regular expressions.
Allows advanced pattern-based splitting. Keep disabled for normal use with plain text separators.
Minimum chunk size
Minimum Chunk Size (Number, Optional): Minimum acceptable chunk size.
Prevents creation of tiny fragments. Chunks smaller than this may be merged with adjacent chunks or skipped. Range: 10-10,000.
Maximum chunks
Maximum Chunks (Number, Optional): Upper limit on number of chunks returned.
If content would produce more chunks, the result is truncated. Useful for limiting processing on very long documents. Range: 1-10,000.
Tokenization strategy
Tokenization Strategy (Dropdown, Default: Model-based): How tokens are interpreted for Token-based mode.
- Model-based - Use model's default tokenizer. Requires Model Name parameter.
- Encoding-based - Specify custom tiktoken encoding. Requires Tiktoken Encoding parameter.
Model name
Model Name (Text, Conditional): Model name for tokenization.
Required when Tokenization Strategy is Model-based. Examples: gpt-4, gpt-3.5-turbo, claude-3-opus, llama3.2, mistral-7b. Variable interpolation with ${model_name} is supported.
Tiktoken encoding
Tiktoken Encoding (Text, Conditional): Tiktoken encoding to use.
Required when Tokenization Strategy is Encoding-based. Common encodings: cl100k_base (GPT-4), p50k_base (GPT-3), r50k_base (GPT-2). Variable interpolation with ${encoding} is supported.
Enable cross-chunking
Enable Cross-Chunking (Toggle, Default: false): Combine multiple small objects into larger chunks.
When enabled, requires Object Mapping configuration to specify how to extract text and metadata from objects.
Cross-chunking configuration
When cross-chunking is enabled, configure attribute mappings:
- Text Attribute Path: Path to text content using dot notation (e.g.,
text,pageContent,data.text) - Start Metadata Path: Path to start time/number/value (e.g.,
start_time,metadata.start,page) - End Metadata Path: Path to end time/number/value (e.g.,
end_time,metadata.end,page) - Metadata Type: Type for future smart merging (Time for timestamps, Number for page numbers)
Common parameters
This node supports common parameters shared across workflow nodes, including Stream Output Response, Streaming Messages, Logging Mode, and Wait For All Edges. For detailed information, see Common Parameters.
Best practices
- Use token-based mode when processing text for LLMs to ensure chunks fit within model limits and avoid truncation
- Set overlap to 10-20% of chunk size to preserve context at boundaries, especially for RAG applications
- Adjust Token Reserve Ratio based on prompt complexity: 0.2-0.3 for simple prompts, 0.3-0.4 for complex prompts with examples
- Customize separators based on content type: paragraph and sentence separators for prose, code-specific separators for source code
- Test different chunk sizes: smaller chunks (200-500 tokens) for precise retrieval, larger chunks (1000-2000 tokens) for summarization
- For cross-chunking subtitle segments or timed data, ensure objects are sorted chronologically before processing
Limitations
- Token counting accuracy: Token-based mode requires model-specific tokenizers. If unavailable, the node falls back to character-based mode.
- Memory usage: Processing very large documents with small chunk sizes can create thousands of chunks. Use Maximum Chunks to limit output.
- Cross-chunking requirements: Requires properly structured input objects with consistent attribute paths. Missing or malformed attributes cause objects to be skipped.
- Separator handling: Regex separators require valid patterns. Invalid patterns cause failure.
- Overlap constraints: Overlap size cannot exceed half the chunk size. The node automatically clamps overlap to chunk_size / 2 if exceeded.