Skip to main content
Version: V11

VLLM Embedding Node

The VLLM Embedding node provides high-performance local embedding generation using VLLM's optimized inference engine with advanced GPU acceleration. It supports tensor parallelism for distributing large models across multiple GPUs, quantization for reduced memory usage, and LoRA adapters for fine-tuned model variants. These optimizations enable production-scale deployments with high throughput and low latency.

How It Works

When the node executes, it receives text input from a workflow variable, loads the specified model from HuggingFace Hub or local cache, processes the text through VLLM's optimized inference pipeline, and returns embedding vectors as arrays of floating-point numbers. Each text input produces one embedding vector, with dimensionality determined by the model. The node initializes the VLLM engine with the specified configuration, processes texts through the model, and stores the resulting vectors in the output variable.

VLLM provides production-grade performance through continuous batching, PagedAttention memory management, and optimized CUDA kernels. The node supports advanced deployment scenarios including tensor parallelism to distribute large models across multiple GPUs, quantization to reduce memory usage, and CPU offloading for models exceeding GPU memory.

Output embeddings maintain correlation with input items through unique identifiers, with each embedding traced back to its source text via UUID. Failed embedding generation for individual items does not stop processing of other items.

Configuration Parameters

Input field

Input Field (Text, Required): Workflow variable containing text to embed.

The node expects a list of embedding request objects where each object contains a type field (set to "text"), an optional id field (string for tracking), and a text field (string content to embed). Single objects are automatically converted to single-item lists.

Output Field

Output Field (Text, Required): Workflow variable where embedding results are stored.

The output is a list of EmbeddingResponse objects where each object contains a uuid field (string identifier matching input ID or generated UUID) and an embeddings field (array of floating-point numbers).

Common naming patterns: text_embeddings, document_vectors, vllm_embeddings, local_embeddings.

Model

Model (Text, Required): HuggingFace model path for embedding generation.

Examples: BAAI/bge-large-en-v1.5, intfloat/e5-large-v2. Models are automatically downloaded from HuggingFace Hub on first use and cached locally. Variable interpolation using ${variable_name} syntax is supported.

Task Type

Task Type (Dropdown, Default: embed): Embedding task type optimization.

  • embed - General-purpose embeddings
  • retrieval - Optimized for search and RAG applications
  • classification - Optimized for categorization tasks

Tensor Parallel Size

Tensor Parallel Size (Number, Optional): Number of GPUs for tensor parallelism.

Distributes large models across multiple GPUs for models that don't fit on a single GPU. Set to the number of available GPUs (e.g., 2, 4, 8).

Pipeline Parallel Size

Pipeline Parallel Size (Number, Optional): Number of pipeline parallel stages.

Enables pipeline parallelism for additional model distribution across GPUs.

Data Type

Data Type (Dropdown, Optional): Model weight precision.

  • auto - Automatic selection based on model
  • half / float16 - 16-bit floating point (faster, less memory)
  • bfloat16 - Brain float 16 (better numerical stability)
  • float32 - 32-bit floating point (slower, more accurate)

Quantization

Quantization (Dropdown, Optional): Quantization method to reduce memory usage.

  • awq - 4-bit quantization (requires AWQ-quantized model)
  • gptq - 4-bit quantization (requires GPTQ-quantized model)
  • fp8 - 8-bit floating point quantization
  • squeezellm / marlin / gptq_marlin - Other quantization methods

Requires compatible pre-quantized model. Reduces memory usage significantly with minimal quality loss.

GPU Memory Utilization*

GPU Memory Utilization (Number, Default: 0.9): Fraction of GPU memory to use (0.0-1.0).

Default 0.9 reserves 10% for other processes. Lower values (0.7-0.8) leave more memory for concurrent workloads.

Max Model Length

Max Model Length (Number, Optional): Maximum model context length in tokens.

Limits the maximum input sequence length. Minimum value is 1.

Max Number of Sequences

Max Number of Sequences (Number, Optional): Maximum sequences per iteration.

Controls batch processing capacity. Higher values increase throughput but require more memory. Minimum value is 1.

Tokenizer

Tokenizer (Text, Optional): Tokenizer name or path.

Defaults to the model's tokenizer if not specified. Use when a different tokenizer is required.

Tokenizer Mode

Tokenizer Mode (Dropdown, Optional): Tokenizer loading mode.

Controls how the tokenizer is loaded and initialized.

Trust Remote Code

Trust Remote Code (Toggle, Default: true): Allow execution of custom code from HuggingFace.

Required for some models but has security implications. Only disable if you trust the model source.

Model Revision

Model Revision (Text, Optional): Model revision (branch/tag/commit).

Specifies a specific model version from HuggingFace Hub.

Tokenizer Revision

Tokenizer Revision (Text, Optional): Tokenizer revision (branch/tag/commit).

Specifies a specific tokenizer version from HuggingFace Hub.

Random Seed

Random Seed (Number, Optional): Random seed for reproducibility.

Ensures consistent results across runs. Minimum value is 0.

Swap Space

Swap Space (Number, Optional): CPU swap space in gigabytes.

Enables CPU memory swapping for models exceeding GPU memory.

CPU Offload

CPU Offload (Number, Optional): CPU memory for offloading in gigabytes.

Offloads model layers to CPU memory when GPU memory is insufficient.

Number of GPUs

Number of GPUs (Number, Optional): Number of GPUs to use.

Limits GPU usage when multiple GPUs are available.

Download Directory

Download Directory (Text, Optional): Directory to download and cache models.

Controls where model files are stored. Use fast storage (SSD) for better loading performance.

Load Format

Load Format (Dropdown, Optional): Format for loading model weights.

Specifies how model weights are loaded from disk.

Enforce Eager Execution

Enforce Eager Execution (Toggle, Optional): Enforce eager execution mode.

Disables CUDA graph optimization for debugging or compatibility.

Disable Custom All-Reduce

Disable Custom All-Reduce (Toggle, Optional): Disable custom all-reduce kernel.

Uses standard all-reduce operations instead of optimized kernels.

Enable prefix caching

Enable Prefix Caching (Toggle, Optional): Enable automatic prefix caching.

Caches common prompt prefixes for better performance with repeated patterns. Improves throughput when processing similar texts.

Disable Sliding Window

Disable Sliding Window (Toggle, Optional): Disable sliding window attention.

Disables sliding window attention mechanism for models that support it.

Enable LoRA

Enable LoRA (Toggle, Optional): Enable LoRA adapter support.

Allows using fine-tuned model variants through LoRA adapters without loading separate full models.

Max LoRA adapters

Max LoRA Adapters (Number, Optional): Maximum number of LoRA adapters to load.

Limits concurrent LoRA adapters when Enable LoRA is active.

Max LoRA rank

Max LoRA Rank (Number, Optional): Maximum LoRA adapter rank.

Limits the rank of LoRA adapters that can be loaded.

Common Parameters

This node supports common parameters shared across workflow nodes, including Stream Output Response, Streaming Messages, and Logging Mode. For detailed information, see Common Parameters.

Best Practices

  • Start with default settings (Model only) and add optimizations based on performance requirements
  • Use Tensor Parallel Size for models exceeding single GPU memory—set to the number of available GPUs
  • Enable Quantization (AWQ or GPTQ) for large models using pre-quantized models from HuggingFace
  • Set GPU Memory Utilization to 0.7-0.8 when running multiple models or workflows concurrently
  • Enable Prefix Caching for workflows processing similar texts repeatedly
  • Use Data Type float16 or bfloat16 for production deployments to balance speed and accuracy
  • Configure Download Directory to use fast storage (SSD) for better model loading performance
  • Monitor GPU memory usage and adjust Max Number of Sequences to optimize throughput

Limitations

  • Local execution only: Models run on the server hosting the workflow engine. Ensure sufficient GPU resources and memory.
  • First-run download delay: The first execution with a new model triggers a download from HuggingFace Hub which may take several minutes to hours depending on model size.
  • Text-only support: The node only supports text embeddings. Image embedding requests fail.
  • GPU memory requirements: Large models and high Max Number of Sequences can exceed GPU memory limits, causing out-of-memory errors.
  • Quantization compatibility: Quantization methods require pre-quantized models. Standard models cannot be quantized on-the-fly.
  • Tensor parallelism overhead: Multi-GPU distribution adds communication overhead. Only beneficial for models that don't fit on single GPU.