PDF Data Extraction

PDF Data Extraction

Upload PDF, Word, Excel or email files and receive structured data as Excel, CSV, JSON or UBL. Process up to 10 files at once. Six extraction types: invoices, floor plans, forms, tables, emails and legal documents.

invoice-2025-042.pdf
Vendor Acme Corp B.V.
Invoice # INV-2025-0042
Date 2025-01-15
Total € 1.512,50
VAT € 262,50
Line items 3 items
6 fields extracted successfully

What do you need to extract?

Choose the type of data you want to pull from your PDFs.

Floor Plan

Extract room names, dimensions, and areas from architectural drawings and blueprints.

Room names
Dimensions
Floor area (m\u{00B2})
AI Excel CSV JSON
Try now

Invoice

Extract line items, totals, tax amounts, and vendor information from invoices.

Line items
VAT breakdown
IBAN / vendor
UBL API
Try now

Form

Extract field names, values, checkboxes, and sections from filled-in forms.

Text fields
Checkboxes
Sections & groups
AI Excel CSV JSON
Try now

Email

Extract sender, recipients, subject, and body from archived emails.

Sender & recipients
Subject & date
Body & attachments
Direct Parse Excel CSV JSON
Try now

Table

Extract tabular data with automatic header detection and data typing.

Headers
Rows & columns
Data types
AI CSV
Try now

Legal Documents

Analyse contracts and agreements: parties, clauses, key terms, and risk flags.

Contract type
Parties & clauses
Risk flags
AI Excel CSV JSON
Try now

Custom

Define your own extraction rules with AI-powered prompts.

Your own fields
AI-powered prompts
Any document
AI JSON
Coming soon

How it works

Three steps from document to structured data.

1

Upload your files

Upload PDF, Word, Excel or email files. Up to 10 files at once, maximum 50 MB per file.

2

Automatic analysis

The document is analysed and relevant fields are detected. Email files are parsed directly, other document types are processed visually.

3

Download structured data

Export as Excel, CSV, JSON or UBL. Each file type returns the fields specific to that document type.

api-example.sh
# Extract invoice data from PDF
curl -X POST https://api.pdfen.com/v2/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "type=invoice" \
  -F "format=json"

# Response
{
  "vendor": "Acme Corp B.V.",
  "invoice_number": "INV-2025-0042",
  "total": 1512.50,
  "vat": 262.50,
  "line_items": [...]
}

Automate with our API

Integrate data extraction into your own application. Submit files via the REST API and receive structured data back as JSON.

  • Batch processing

    Process multiple files per API call. Supports PDF, Word, Excel and email.

  • Webhooks

    Receive a callback when extraction is complete — no polling needed.

  • Multiple formats

    Get results as JSON, CSV, Excel or UBL (for invoices).

Output formats

Export your extracted data in the format that fits your workflow.

.xlsx

Excel

Spreadsheet-ready with formatting and multiple sheets.

.csv

CSV

Universal format for databases and data tools.

.json

JSON

Structured data for APIs and applications.

.xml

XML

Enterprise format for system integrations.

UBL

UBL

European e-invoicing standard (EN 16931).

Frequently asked questions

What file types can I upload?

You can upload PDF, Word (.doc, .docx), Excel (.xls, .xlsx) and email files (.eml, .msg). Word, Excel and email files are automatically converted for processing. Email files are parsed directly without conversion. Up to 10 files at once, maximum 50 MB per file.

How accurate are the results?

It depends on the extraction type. Email extraction is 100% accurate because files are parsed directly. PDF forms with built-in fields (AcroForm) are also 100% accurate. For visual extraction (invoices, tables, legal), each extracted field includes a confidence score (high/medium/low) so you can see where manual review may be useful.

What languages are supported?

Visual extraction works with documents in all common languages, including Dutch, English, German, French and Spanish. Extracted data fields are standardised regardless of the source language. Email extraction is language-independent.

What does data extraction cost?

Costs vary by extraction type. PDF forms with AcroForm fields cost 1 credit. Invoices and tables cost 2 credits per PDF (3 per Word/Excel file). Legal documents cost 3 credits per PDF (4 per Word file). Emails cost 2 credits per file. New users receive 15 free credits on registration.

Can I automate extraction through an API?

Yes. Via the REST API you can submit files and receive structured results as JSON. Webhooks notify you when processing is complete. The API supports all six extraction types and all file formats.

Ready to extract data from your documents?

Create a free account and receive 15 credits to get started.