User guide · Part 7 bAIbel AV
English
Draft

Preparing PDFs for translation

A PDF is not a translatable document. It is a fixed picture of a page, often with no clean text underneath. Before you can translate one, you need to turn it into structured text. The PDF Preparation feature does exactly that: it converts a PDF into clean Markdown, HTML, or structured data, ready to bring into a project. This guide explains the cycle and the conversion options, so you can match the method to the document and to its confidentiality.

The same privacy thinking as guide 5

Converting a PDF can mean sending it to an outside service, just like translating with a model can. The choices here follow the same logic as choosing an LLM: you can keep the work entirely on your own machine, send it to a cloud service, or run it on hardware you control. Each card in the tool is marked 🔒 Local or ☁ Cloud so the choice is always visible.

The preparation cycle

Open PDF Preparation from the main navigation. Every conversion follows the same short cycle, whether you are doing one file or many.

  1. Choose Single file or Batch.
  2. Pick the Source PDF (or add several), and an Output to folder.
  3. Choose a conversion provider and its model from the cards.
  4. Choose an Output format.
  5. Run the conversion with Convert (or Start batch).
  6. Review the result, optionally clean it up, then open it or send it straight into a new project.
Figure 1. The PDF Preparation screen. Choose single or batch, pick the source and output folder, then select a provider.

The conversion options

The heart of the tool is the choice of provider — the engine that reads your PDF. Each is presented as a card with its strengths, a privacy badge, and a status showing whether it is ready to use. There are five, and they suit very different documents.

Figure 2. The provider cards. Each shows what it is good at, a 🔒 Local or ☁ Cloud badge, and a readiness status.
ProviderWhere it runsBest forPrivacy
Local pdf4llm Your own machine (🔒 Local) Clean, digital PDFs. Weaker on scans and complex tables. Nothing leaves your device.
Mistral OCR Cloud (Mistral) Layout-rich PDFs and tables, producing clean Markdown or structured data. Your PDF is uploaded to Mistral.
AWS Textract Cloud (Amazon) Tables, forms, and key-value pairs from scanned PDFs. Your PDF is staged in your own AWS storage and deleted after processing.
AWS SageMaker (custom) Your own AWS endpoint Your own OCR or document model, if you have one deployed. Sent only to your own endpoint, within your AWS account.
RunPod A GPU pod you rent (Cloud) Scientific or layout-heavy PDFs, using models such as Marker, MinerU, or Nougat. Sent to a pod you own, which is torn down a few minutes after the job.
A card may show it is not ready

Cloud providers need a key, and local or self-hosted ones need a server or endpoint. A card that is not yet set up shows a status such as ⚠ key not set or ⚠ server not set, with a link to Integrations to configure it. Add what it asks for and the card becomes available.

Adding your own models (catalogs)

The two flexible providers — SageMaker and RunPod — let you register your own models in a catalog. Use Manage models on the card to open the catalog, where you can + Add an entry, and for RunPod Reset to seed to restore the built-in Marker, MinerU, and Nougat models. Each entry records the model’s name, its output format, and the technical details needed to call it.

Figure 3. A model catalog. Register your own models, or restore the built-in set, then pick one on the provider card.

Output formats

Each provider can write its result in one or more formats. Pick the one that suits what you will do next.

FormatWhat it isGood when
MarkdownClean, lightly formatted text. Supported by every provider.The usual choice for translation. Simple and readable.
HTMLText with richer structure.You need headings, tables, and layout carried through.
Structured JSONThe document broken into data, page by page.You want to process the content programmatically.
Structured XMLA structured, tagged representation.A downstream tool expects XML.

There is also an option to Retain inline image data in output. Left off, images are saved to a side folder and referenced, which keeps the text file small and clean. Turn it on only if you need the images embedded.

Understanding the cost

Different providers charge in different shapes, and the tool reflects both.

Provider typeHow it billsWhat you see
Per-call (Mistral, Textract)A charge per document or per page.Billing is handled through your provider account.
Time-billed (RunPod)For the seconds the rented GPU pod is running.An elapsed-time meter, which doubles as the cost meter.

Because a RunPod GPU bills for its running time, the tool warns you before it starts a fresh pod. The Start a new RunPod pod? notice explains that starting one takes a few minutes and bills for the pod’s lifetime, and that the pod shuts down automatically a few minutes after the job. You can choose Don’t ask again this session once you are comfortable.

Figure 4. The RunPod start notice. It appears before a time-billed pod starts, so a cost is never incurred by surprise.

Cleaning up the result

OCR is rarely perfect. After a conversion, the Post-process this output section offers cleanup steps that apply to your result — for example fixing punctuation and spacing, or converting structured output into tidy Markdown. Each step has a short description and a Run button, and you can run more than one.

Best-effort, not magic

Cleanup improves a result; it does not guarantee a perfect one. Review the output before you translate it, especially for scanned or layout-heavy documents.

Figure 5. A finished conversion. Open or reveal the file, run optional cleanup steps, or send it straight into a new project.

Converting many PDFs at once

Switch to the Batch tab to convert a set of PDFs in one run. Add files with Add files… or by dropping them onto the list, then start the run. A progress table tracks every file with a status — queued, converting, done, failed, or cancelled — and an elapsed meter for the whole batch. You can open or process each finished file from its row.

Figure 6. Batch progress. Every file has its own status and elapsed time, with the cumulative meter at the top.

From result to translation

When the result looks right, choose Use in new project… to carry it straight into a project. From there you pick up the core flow from guide 1: set the language pair, apply your privacy settings, translate with the memory and termbase, and export. The PDF has become a normal translatable document.

Matching the method to the document

As with everything else in bAIbel AV, the right choice depends on the document.

Your PDFA good choice
A clean, digital PDF — and confidentialLocal pdf4llm. Capable enough for clean text, and nothing leaves your machine.
A scanned or layout-heavy PDF, not confidentialMistral OCR or AWS Textract. Strong on tables and scans.
A scientific or complex PDFRunPod, with a model such as Marker or MinerU.
A highly confidential PDFKeep it in-house: Local pdf4llm, or your own SageMaker or RunPod setup, so the document stays under your control.
Figure 7. Match the provider to the PDF. Clean and confidential favours local; scanned and public favours a strong cloud OCR; complex favours a GPU model.

Terminology used in this guide

PDF Preparation
The feature that converts a PDF into structured text ready for translation.
Provider
The engine that reads a PDF, such as Local pdf4llm, Mistral OCR, AWS Textract, AWS SageMaker, or RunPod.
OCR
Optical character recognition — reading text out of an image or scan.
Catalog
The list of your own registered models for the SageMaker and RunPod providers.
Output format
The shape of the converted result: Markdown, HTML, structured JSON, or structured XML.
Post-processing
Optional cleanup steps applied to a result, such as fixing spacing and punctuation.