Extracting Text from PDF Pages

Description

Text extraction is one of the most common operations when working with PDFs. PDFium provides powerful text extraction capabilities through its WebAssembly API. This guide explains how to extract text from PDF pages, which is useful for searching, indexing, and making PDF content accessible.

Interactive Example

This interactive example demonstrates how to extract text from PDF pages. You can navigate between pages to see the extracted text from each page:

PREVIEW:Extract Text from PDF
Loading...

Extracted Text:

No text extracted yet.

The Basic Workflow

Extracting text from a PDF page involves several steps:

  1. Initialize PDFium
  2. Load the PDF document
  3. Load the specific page
  4. Create a text page object for text extraction
  5. Get the character count
  6. Allocate a buffer for the text
  7. Extract the text into the buffer
  8. Convert the text from UTF-16LE to a JavaScript string
  9. Clean up resources

Prerequisites

The examples in this guide use the initializePdfium helper function from our Getting Started guide. This function properly initializes the PDFium library, including calling the required PDFiumExt_Init() function. Make sure to include this initialization code in your application before trying the examples below.

Example Implementation

Here’s a complete implementation for loading a PDF document and extracting text from its pages:

import { initializePdfium } from './initialize-pdfium'; // Note: The initializePdfium function is a helper that initializes the PDFium library. // For the full implementation, see: /docs/pdfium/getting-started /** * Loads a PDF document and returns an object with methods to access and extract text from it. * This example focuses solely on extracting text from PDF pages. */ export async function loadPdfForTextExtraction(pdfData: Uint8Array) { // Initialize PDFium const pdfium = await initializePdfium(); // Allocate memory for the PDF data const filePtr = pdfium.pdfium.wasmExports.malloc(pdfData.length); pdfium.pdfium.HEAPU8.set(pdfData, filePtr); // Load the document const docPtr = pdfium.FPDF_LoadMemDocument(filePtr, pdfData.length, 0); if (!docPtr) { const error = pdfium.FPDF_GetLastError(); pdfium.pdfium.wasmExports.free(filePtr); // Handle password-protected documents if (error === 4) { return { hasPassword: true, pageCount: 0, close: () => {}, // No-op since no document was loaded extractText: async () => { throw new Error('Document is password protected'); } }; } throw new Error(`Failed to load PDF: ${error}`); } // Get page count const pageCount = pdfium.FPDF_GetPageCount(docPtr); // Return an object with document info and text extraction capabilities return { hasPassword: false, pageCount, // Close the document and free resources close: () => { pdfium.FPDF_CloseDocument(docPtr); pdfium.pdfium.wasmExports.free(filePtr); }, // Extract all text from a specific page extractText: async (pageIndex: number): Promise<string> => { // Check if the page index is valid if (pageIndex < 0 || pageIndex >= pageCount) { throw new Error(`Invalid page index: ${pageIndex}. Document has ${pageCount} pages.`); } // Load the page for rendering const pagePtr = pdfium.FPDF_LoadPage(docPtr, pageIndex); if (!pagePtr) { throw new Error(`Failed to load page ${pageIndex}`); } try { // Load the page text (a separate structure for text extraction) const textPagePtr = pdfium.FPDFText_LoadPage(pagePtr); if (!textPagePtr) { throw new Error(`Failed to load text for page ${pageIndex}`); } try { // Get the character count on the page const charCount = pdfium.FPDFText_CountChars(textPagePtr); if (charCount <= 0) { return ''; // No text on this page or error } // Allocate a buffer for the text (+1 for null terminator) const bufferSize = (charCount + 1) * 2; // UTF-16, 2 bytes per character const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize); try { // Extract the text (PDFium uses UTF-16LE encoding) const extractedLength = pdfium.FPDFText_GetText( textPagePtr, 0, // Start index charCount, // Character count to extract textBufferPtr // Output buffer ); // If text was extracted, convert from UTF-16LE to JavaScript string if (extractedLength > 0) { // Convert UTF-16 array to JavaScript string return pdfium.pdfium.UTF16ToString(textBufferPtr); } return ''; // No text extracted } finally { // Clean up text buffer pdfium.pdfium.wasmExports.free(textBufferPtr); } } finally { // Clean up text page pdfium.FPDFText_ClosePage(textPagePtr); } } finally { // Clean up page pdfium.FPDF_ClosePage(pagePtr); } } }; }

Usage Example

Here’s a simple example of how to use the text extraction functionality:

// Load a PDF file and extract text from its pages async function extractTextFromPdf() { try { // Fetch the PDF file const response = await fetch('sample.pdf'); const arrayBuffer = await response.arrayBuffer(); const pdfData = new Uint8Array(arrayBuffer); // Load the document const pdfDocument = await loadPdfForTextExtraction(pdfData); // Get the page count const pageCount = pdfDocument.pageCount; console.log(`PDF has ${pageCount} pages`); // Extract text from each page for (let i = 0; i < pageCount; i++) { const text = await pdfDocument.extractText(i); console.log(`--- Page ${i + 1} ---`); console.log(text || '[No text on this page]'); } // Clean up when done pdfDocument.close(); } catch (error) { console.error('Error extracting text:', error); } } extractTextFromPdf();

Memory Management Considerations

When extracting text from PDF pages, proper memory management is critical to prevent memory leaks:

  1. Text page cleanup: Always close text pages using FPDFText_ClosePage when you’re done with them.

  2. Page cleanup: Always close pages using FPDF_ClosePage when you’re done with them.

  3. Buffer cleanup: Free any allocated memory buffers using pdfium.pdfium.wasmExports.free() when you’re done with them.

  4. Nested try/finally blocks: Use nested try/finally blocks to ensure proper cleanup of resources in the correct order, even if errors occur.

Text Extraction Limitations

When working with text extraction in PDFs, be aware of these limitations:

  1. Text recognition: PDFium can only extract text that is actually stored as text in the PDF. It cannot perform OCR (Optical Character Recognition) on scanned documents or images.

  2. Text order: The order of extracted text might not always match the visual order on the page, especially for complex layouts or documents with multiple columns.

  3. Special characters: Some special characters or symbols might not be correctly extracted or might be represented differently in the extracted text.

  4. Password protection: Text cannot be extracted from password-protected PDFs without providing the correct password.

Last updated on March 31, 2025