A PDF is not a translatable document. It is a fixed picture of a page, often with no clean text underneath. Before you can translate one, you need to turn it into structured text. The PDF Preparation feature does exactly that: it converts a PDF into clean Markdown, HTML, or structured data, ready to bring into a project. This guide explains the cycle and the conversion options, so you can match the method to the document and to its confidentiality.
Converting a PDF can mean sending it to an outside service, just like translating with a model can. The choices here follow the same logic as choosing an LLM: you can keep the work entirely on your own machine, send it to a cloud service, or run it on hardware you control. Each card in the tool is marked 🔒 Local or ☁ Cloud so the choice is always visible.
Open PDF Preparation from the main navigation. Every conversion follows the same short cycle, whether you are doing one file or many.
The heart of the tool is the choice of provider — the engine that reads your PDF. Each is presented as a card with its strengths, a privacy badge, and a status showing whether it is ready to use. There are five, and they suit very different documents.
| Provider | Where it runs | Best for | Privacy |
|---|---|---|---|
| Local pdf4llm | Your own machine (🔒 Local) | Clean, digital PDFs. Weaker on scans and complex tables. | Nothing leaves your device. |
| Mistral OCR | Cloud (Mistral) | Layout-rich PDFs and tables, producing clean Markdown or structured data. | Your PDF is uploaded to Mistral. |
| AWS Textract | Cloud (Amazon) | Tables, forms, and key-value pairs from scanned PDFs. | Your PDF is staged in your own AWS storage and deleted after processing. |
| AWS SageMaker (custom) | Your own AWS endpoint | Your own OCR or document model, if you have one deployed. | Sent only to your own endpoint, within your AWS account. |
| RunPod | A GPU pod you rent (Cloud) | Scientific or layout-heavy PDFs, using models such as Marker, MinerU, or Nougat. | Sent to a pod you own, which is torn down a few minutes after the job. |
Cloud providers need a key, and local or self-hosted ones need a server or endpoint. A card that is not yet set up shows a status such as ⚠ key not set or ⚠ server not set, with a link to Integrations to configure it. Add what it asks for and the card becomes available.
The two flexible providers — SageMaker and RunPod — let you register your own models in a catalog. Use Manage models on the card to open the catalog, where you can + Add an entry, and for RunPod Reset to seed to restore the built-in Marker, MinerU, and Nougat models. Each entry records the model’s name, its output format, and the technical details needed to call it.
Each provider can write its result in one or more formats. Pick the one that suits what you will do next.
| Format | What it is | Good when |
|---|---|---|
| Markdown | Clean, lightly formatted text. Supported by every provider. | The usual choice for translation. Simple and readable. |
| HTML | Text with richer structure. | You need headings, tables, and layout carried through. |
| Structured JSON | The document broken into data, page by page. | You want to process the content programmatically. |
| Structured XML | A structured, tagged representation. | A downstream tool expects XML. |
There is also an option to Retain inline image data in output. Left off, images are saved to a side folder and referenced, which keeps the text file small and clean. Turn it on only if you need the images embedded.
Different providers charge in different shapes, and the tool reflects both.
| Provider type | How it bills | What you see |
|---|---|---|
| Per-call (Mistral, Textract) | A charge per document or per page. | Billing is handled through your provider account. |
| Time-billed (RunPod) | For the seconds the rented GPU pod is running. | An elapsed-time meter, which doubles as the cost meter. |
Because a RunPod GPU bills for its running time, the tool warns you before it starts a fresh pod. The Start a new RunPod pod? notice explains that starting one takes a few minutes and bills for the pod’s lifetime, and that the pod shuts down automatically a few minutes after the job. You can choose Don’t ask again this session once you are comfortable.
OCR is rarely perfect. After a conversion, the Post-process this output section offers cleanup steps that apply to your result — for example fixing punctuation and spacing, or converting structured output into tidy Markdown. Each step has a short description and a Run button, and you can run more than one.
Cleanup improves a result; it does not guarantee a perfect one. Review the output before you translate it, especially for scanned or layout-heavy documents.
Switch to the Batch tab to convert a set of PDFs in one run. Add files with Add files… or by dropping them onto the list, then start the run. A progress table tracks every file with a status — queued, converting, done, failed, or cancelled — and an elapsed meter for the whole batch. You can open or process each finished file from its row.
When the result looks right, choose Use in new project… to carry it straight into a project. From there you pick up the core flow from guide 1: set the language pair, apply your privacy settings, translate with the memory and termbase, and export. The PDF has become a normal translatable document.
As with everything else in bAIbel AV, the right choice depends on the document.
| Your PDF | A good choice |
|---|---|
| A clean, digital PDF — and confidential | Local pdf4llm. Capable enough for clean text, and nothing leaves your machine. |
| A scanned or layout-heavy PDF, not confidential | Mistral OCR or AWS Textract. Strong on tables and scans. |
| A scientific or complex PDF | RunPod, with a model such as Marker or MinerU. |
| A highly confidential PDF | Keep it in-house: Local pdf4llm, or your own SageMaker or RunPod setup, so the document stays under your control. |