Skip to main content
Version: V11

VLLM API Embedding Node

The VLLM API Embedding node connects to a self-hosted VLLM server using an OpenAI-compatible API for high-performance vector generation. It supports dimension reduction on compatible models, automatic retry logic, and batch processing for reliability. The OpenAI-compatible API allows seamless migration between VLLM and OpenAI services by changing the endpoint URL.

How It Works

When the node executes, it receives text input from a workflow variable, sends the text to the VLLM server via HTTP requests using OpenAI's API format, and returns embedding vectors as arrays of floating-point numbers. Each text input produces one embedding vector, with dimensionality determined by the model loaded on the VLLM server. The node constructs API requests with the specified model identifier, sends batched requests to the VLLM endpoint, and stores the resulting vectors in the output variable.

VLLM provides high-performance inference through optimized batching, continuous batching, and efficient GPU utilization, making it ideal for production deployments requiring high throughput and low latency. The node implements automatic retry logic for failed requests and batch processing to group multiple texts into single API calls, reducing network overhead.

Output embeddings maintain correlation with input items through unique identifiers, with each embedding traced back to its source text via UUID. The node supports optional dimension reduction for models that implement this feature. Failed embedding generation for individual items does not stop processing of other items.

Configuration Parameters

Input Field

Input Field (Text, Required): Workflow variable containing text to embed.

The node expects a list of embedding request objects where each object contains a type field (set to "text"), an optional id field (string for tracking), and a text field (string content to embed). Single objects are automatically converted to single-item lists.

Example input structure:

[
{"type": "text", "id": "doc1", "text": "First document content"},
{"type": "text", "id": "doc2", "text": "Second document content"}
]

Output Field

Output Field (Text, Required): Workflow variable where embedding results are stored.

The output is a list of EmbeddingResponse objects where each object contains a uuid field (string identifier matching input ID or generated UUID) and an embeddings field (array of floating-point numbers). The list maintains the same order as the input. Empty embeddings are returned for failed generation attempts.

Example output structure:

[
{"uuid": "doc1", "embeddings": [0.123, -0.456, 0.789, ...]},
{"uuid": "doc2", "embeddings": [0.234, -0.567, 0.890, ...]}
]

Common naming patterns: text_embeddings, document_vectors, vllm_embeddings, server_embeddings.

Use Self-Hosted Model

Use Self-Hosted Model (Toggle, Default: false): Enable custom VLLM server connection.

When enabled, Model and VLLM API Base URL parameters must be provided. When disabled, the node uses the default embedding configuration from the AI service. Enable for complete control over models and deployment when connecting to your own VLLM infrastructure.

Model

Model (Text, Required when self-hosted): Model identifier hosted on the VLLM server.

Examples: BAAI/bge-large-en-v1.5, intfloat/e5-large-v2. The model must be loaded on the VLLM server before use. Variable interpolation using ${variable_name} syntax is supported.

VLLM API Base URL

VLLM API Base URL (Text, Required when self-hosted): VLLM server endpoint with OpenAI-compatible API.

The URL typically ends with /v1 (e.g., http://your-server:8000/v1, http://localhost:8000/v1). The server must be running and accessible before workflow execution. Variable interpolation is supported.

API Key

API Key (Text, Optional): API key for VLLM server authentication.

Leave empty if no authentication is needed. Variable interpolation with ${variable_name} syntax enables secure credential management.

Dimensions

Dimensions (Number, Optional): Number of dimensions for output embeddings.

Leave empty to use the model's default dimensions. Allows dimension reduction to reduce storage requirements while maintaining quality. Only works with models that implement dimension reduction. Minimum value is 1.

Embedding Context Length

Embedding Context Length (Number, Optional): Maximum tokens the model can process.

Default is 8191. Texts exceeding this length are truncated or rejected depending on the Check Embedding Context Length setting.

Chunk Size

Chunk Size (Number, Optional): Number of texts per API request.

Higher values improve throughput but increase memory usage. Lower values reduce memory usage and allow finer-grained error handling. Minimum value is 1.

Max Retries

Max Retries (Number, Optional): Maximum retry attempts for failed API requests.

The node automatically retries with exponential backoff before giving up. Higher values improve reliability for transient issues. Minimum value is 0 (no retries).

Request Timeout

Request Timeout (Number, Optional): Maximum seconds to wait for API response.

Increase for large batches or slow networks. Prevents workflows from hanging on unresponsive API calls. Minimum value is 1 second.

Show Progress Bar

Show Progress Bar (Toggle, Optional): Display progress indicator during batch operations.

Shows progress information including texts processed and estimated time remaining in execution logs.

Skip Empty Strings

Skip Empty Strings (Toggle, Optional): Automatically skip empty text strings.

When enabled, filters out empty strings before processing, avoiding API errors. Empty strings produce empty embeddings to maintain index alignment.

Check Embedding Context Length*

Check Embedding Context Length (Toggle, Optional): Validate text length before sending to API.

When enabled, checks if text exceeds the Embedding Context Length parameter and prevents sending oversized texts. Helps catch errors early with clear error messages.

Common Parameters

This node supports common parameters shared across workflow nodes, including Stream Output Response, Streaming Messages, and Logging Mode. For detailed information, see Common Parameters.

Best Practices

  • Deploy VLLM server on GPU-equipped infrastructure for optimal performance
  • Enable Use Self-Hosted Model for your own VLLM infrastructure; disable for centralized default configuration
  • Configure Chunk Size based on text length: larger chunks (100-500) for short texts, smaller chunks (10-50) for long texts
  • Enable Max Retries (3-5) for production workflows to handle transient network issues
  • Store VLLM API Base URL in workflow variables for easy switching between development, staging, and production servers
  • Monitor VLLM server resource usage (GPU memory, CPU, network) to identify bottlenecks
  • Use the same model for both document and query embeddings in search systems to ensure vector compatibility

Limitations

  • External server dependency: The node requires a running VLLM server. The workflow fails if the server is unreachable or not responding.
  • Model pre-loading required: Models must be loaded on the VLLM server before use. The node does not load or manage models.
  • Text-only support: The node only supports text embeddings. Image embedding requests fail even though the node accepts multimodal input format.
  • OpenAI API compatibility required: The VLLM server must be configured with OpenAI-compatible API endpoint. Other API formats are not supported.
  • Network latency: Embedding performance depends on network latency between the workflow engine and VLLM server. Co-locate them when possible.
  • Dimension reduction support: The Dimensions parameter only works with models that implement dimension reduction.