📝

PDF to Text

Extract all text from a PDF — copy or download as .txt. Runs in your browser, no upload.

Loading PDF engine…

About this tool

Extract every word from any PDF directly in your browser — no upload, no account, no watermark. Pick a page range, choose whether to preserve line breaks or join paragraphs, optionally add page-number headers, then copy to clipboard or download as a .txt or .md file.

🔒100% client-side — your PDF never leaves your browser
📋One-click copy or download as .txt / .md
🎚️Page range support — extract just what you need
📐Preserve line breaks or join into paragraphs
🔢Optional per-page section headers
📊Live word and character count

How to use it

Quick steps to get the most out of this utility.

  1. 1

    Drop your PDF

    Drag and drop a PDF onto the upload area, or click to browse. Any size up to 100 MB.

  2. 2

    Pick a range (optional)

    Choose All pages (default) or enter a custom page range like 1-5, 8, 10-12.

  3. 3

    Choose layout

    Preserve line breaks keeps every newline as-is. Joined paragraphs collapses single newlines into spaces while keeping paragraph breaks.

  4. 4

    Copy or download

    Click Copy to send the text to your clipboard, or download as a .txt or .md file.

When you need the text out of a PDF

PDFs are designed for consistent visual presentation, not for extracting and reusing content. Yet the need to get text out of a PDF comes up constantly — copying a contract clause into an email, feeding a research paper into a summarizer, importing a report into a spreadsheet, or archiving a document as plain text. Every server-side converter asks you to upload your file to do this, which is a privacy trade-off that is rarely worth making.

Toolisk's PDF to Text tool uses pdf.js — Mozilla's open-source PDF rendering engine — entirely inside your browser tab. The PDF bytes are read into memory, the text layer is parsed locally, and the result is displayed immediately. Nothing is transmitted anywhere. For a typical 10-page business document, extraction completes in under a second.

How pdf.js reads text

Real PDFs (as opposed to scans) embed a hidden text layer alongside the visual rendering. This text layer contains the string content, font, size, and XY coordinates of every character on every page. pdf.js walks through this layer in PDF object order, reconstructing runs of text. The tool groups characters by their Y coordinate — items on the same line stay together, and a line break is inserted when the Y position changes significantly.

This approach works well for single-column documents: contracts, books, reports, slide decks exported as PDF. It works less well for multi-column academic papers, because the PDF object order often interleaves the two columns rather than reading left-to-right, top-to-bottom as a human would. If you are processing academic papers, extracting one column at a time via the page range option — and then manually reordering — is the practical workaround until true column-aware extraction is available.

Scanned PDFs and OCR

A scanned PDF is just a sequence of images — there is no text layer. Scanners and photocopiers produce image-only PDFs; some older document workflows do too. If you drop a scanned PDF into this tool, you will see the amber warning: the tool has detected that fewer than 10 non-whitespace characters were found across all pages, which almost always means the file is image-only.

Extracting text from scanned pages requires Optical Character Recognition (OCR) — a much heavier operation that involves running a neural network over image data. Browser-based OCR (using Tesseract.js or similar) is technically possible but is slow and memory-intensive. We plan to add a dedicated scanned-PDF OCR tool in the future. For now, if you need OCR, Adobe Acrobat, Google Drive (upload and open as Docs), or a dedicated OCR service are the practical options.

Why no upload matters for text documents

Text-heavy PDFs are often the most sensitive documents you handle: signed contracts, non-disclosure agreements, legal filings, medical records, financial statements. Uploading these to a third-party server — even one with good privacy policies — means the content has left your control. Toolisk's client-side approach means the text never touches a server. This is not a privacy promise — it is a technical property of how the tool works.

Frequently asked questions

Is this safe? Does it upload my PDF?+

No upload whatsoever. The entire extraction runs in your browser using JavaScript and the open-source pdf.js library. Your PDF never leaves your device, is never sent to a server, and is never logged. This makes it safe for sensitive files like contracts, medical reports, or financial statements.

What is the maximum file size?+

PDFs up to 100 MB are accepted. Files over 30 MB may be slower on mobile. For very large PDFs, consider splitting the file first using the Split PDF tool, then extracting text from each part.

Does it work offline?+

After the page has loaded once, yes — the PDF engine is cached in your browser and extraction runs locally even without an internet connection.

Will this work on iPhone / iPad?+

Yes, on modern iOS Safari. iOS limits per-tab memory, so very large PDFs may be slow. For best results on mobile, use the page range option to extract just the pages you need.

Why is the extracted text empty?+

Your PDF is probably a scan (a photo of a page, not a real text PDF). Real PDFs have a hidden text layer that we can read; scans don't. Extracting text from scans needs OCR, which we don't yet offer client-side.

Why does the text look jumbled on two-column papers?+

pdf.js reads text in PDF object order, which on multi-column layouts often interleaves the columns. Single-column layouts (most office docs, books, contracts) extract cleanly.

What's the difference between .txt and .md output?+

The content is identical. The .md extension just means apps that recognize Markdown (Obsidian, VS Code, GitHub) will treat the file like a Markdown document. Pick whatever your downstream tool prefers.

Can I extract tables as proper rows and columns?+

Not reliably — PDF doesn't store tables as structured data, just as text positioned at coordinates. The output will contain all the table cells in roughly reading order, but you'll need to reformat manually.

Keep exploring

More utilities and reading from Toolisk.