Skip to main content
Version: V11

PDF Loader Node

The PDF Loader node retrieves PDF documents from URLs and extracts their text content into structured document objects. It supports password-protected PDFs, configurable table extraction formats, and page-level or document-level output modes. Page range filtering allows processing specific sections of large documents.

How It Works

When the node executes, it downloads the PDF file, extracts text from each page, and creates document objects containing the content along with metadata like page numbers and source information. The extracted content preserves document structure and can optionally include tables in various formats for downstream processing.

The node offers two output modes. Page mode creates one document object per page, suitable for page-specific processing or parallel workflows. Single mode merges all pages into one document object, better for full-document analysis or maintaining cross-page context. Page mode distributes content across multiple objects with lower memory per object, while Single mode loads all content into one object with simpler structure but higher memory usage.

Table extraction can preserve tabular data in formats suitable for different downstream needs. When disabled, tables are extracted as plain text (fastest but loses structure). Markdown format maintains readability while preserving table structure, recommended for language model processing. HTML format keeps detailed structure including styling information. CSV format converts tables to comma-separated values for data processing applications.

Configuration Parameters

Input Field

Input Field (Text, Required): Workflow variable containing the PDF URL.

The URL must start with http:// or https:// and point to a valid PDF file with .pdf extension. Variable interpolation using ${variable_name} syntax supports dynamic URL construction.

Common patterns: https://storage.example.com/document.pdf, ${document_url}, https://api.example.com/files/${file_id}.pdf.

Output Field

Output Field (Text, Required): Workflow variable where extracted document objects are stored.

The output is an array of document objects. In Page mode, the array contains one object per page. In Single mode, the array contains one object with all pages merged. Each document object includes page_content (extracted text), metadata.page (page number, 0-indexed), and metadata.source (original URL).

Common naming patterns: pdf_documents, extracted_pages, document_content, pdf_data.

Page Range

Page Range (Text, Optional): Pages to extract using formats like 1-5, 1,3,5, or 1-5,10,15-20.

Uses 1-based indexing where page 1 is the first page. Leave empty to process all pages. Variable interpolation with ${page_range} supports dynamic page selection. Page ranges reduce processing time and memory usage for large documents.

Password

Password (Text, Optional): Password for encrypted PDFs.

Leave empty if the PDF is not encrypted. Variable interpolation with ${variable_name} supports dynamic password resolution. Only password-based encryption is supported; certificate-based encryption requires preprocessing outside the workflow.

Table Extraction Eormat

Table Extraction Format (Dropdown, Default: None): Format for extracting tables from PDF.

FormatOutputBest forPerformance
None (text only)Plain text without table structureWhen tables aren't neededFastest
MarkdownTables as Markdown syntaxLLM processing and analysisFast
HTMLTables with HTML tagsWeb display, complex table structuresModerate
CSVTables as comma-separated valuesData processing, spreadsheetsModerate

Output Mode

Output Mode (Dropdown, Default: Page): How to structure the extracted content.

ModeOutput structureUse when
PageOne document per page (10 pages = 10 objects)Processing pages individually, parallel workflows, page-specific analysis
SingleAll pages merged into one document (10 pages = 1 object)Full-document LLM analysis, maintaining cross-page context

Common Parameters

This node supports common parameters shared across workflow nodes, including Stream Output Response and Logging Mode. For detailed information, see Common Parameters.

Best Practices

  • Page mode is recommended when analyzing individual pages or running parallel processing; Single mode when full-document context is needed for language model analysis
  • Specify page ranges to extract only needed sections, reducing processing time, memory usage, and token consumption in downstream LLM nodes
  • Select table extraction format based on downstream needs: None for fastest performance, Markdown for LLM processing, HTML for web display, CSV for data analysis
  • Variable interpolation for passwords improves security and reusability over hardcoded values
  • Test extraction quality with small page ranges before processing entire documents to verify text and table extraction meets requirements

Limitations

  • URL-Only Support: The node only supports loading PDFs from URLs (HTTP/HTTPS). Local file paths are not supported.
  • No Authentication Headers: Custom HTTP headers for authentication are not supported. Credentials must be included in the URL as query parameters, or the endpoint must be publicly accessible.
  • Download Timeout Range: Download timeout is configurable between 10 seconds and 5 minutes (10,000-300,000ms). Very large PDF files or slow connections may exceed the maximum timeout.
  • Password-Based Encryption Only: Only password-protected PDFs are supported. Certificate-based encryption or other advanced PDF security features require preprocessing outside the workflow.
  • PDF Extension Required: Files must have .pdf extension. The node validates file extension and rejects files without proper extension.
  • Page Range Validation: Invalid page ranges (pages exceeding document length) are handled gracefully by loading available pages, but may produce unexpected results.
  • Memory Usage: Large PDFs loaded in Single mode can consume significant memory. Page mode is recommended for documents over 100 pages.
  • Table Extraction Accuracy: Table extraction quality depends on PDF structure. Scanned PDFs or complex layouts may not extract tables accurately.