Skip to main content
Use config.documents to decide whether the pipeline should re-run files it has already processed and whether to extract granular elements in addition to the default Markdown output.
Default fields are: reprocess_documents is true and extract_elements is false. You only need to send fields you want to override.
import type { JobInput } from "@trelent/data-ingestion";

const job: JobInput = {
  connector: { 
    type: "url", 
    urls: ["https://signed.example.com/contract.pdf"] 
  },
  output: { type: "s3-signed-url" },
  config: {
    documents: {
      reprocess_documents: false,
      extract_elements: true,
    },
  },
};

Field reference

config.documents.reprocess_documents
boolean
default:"true"
Re-process pages of documents that have been determined to be of a lower quality. Set to false for faster incremental ingestions when your connector identifiers are stable.
config.documents.extract_elements
boolean
default:"false"
Emit structured PDF element metadata alongside Markdown. Turn this on when downstream consumers need more robust document breakdown.