Extracting Text from PDF Pages

Description

Text extraction is one of the most common operations when working with PDFs. PDFium provides powerful text extraction capabilities through its WebAssembly API. This guide explains how to extract text from PDF pages, which is useful for searching, indexing, and making PDF content accessible.

Interactive Example

This interactive example demonstrates how to extract text from PDF pages. You can navigate between pages to see the extracted text from each page:

PREVIEW:Extract Text from PDF

Extracted Text:

No text extracted yet.

The Basic Workflow

Extracting text from a PDF page involves several steps:

Initialize PDFium
Load the PDF document
Load the specific page
Create a text page object for text extraction
Get the character count
Allocate a buffer for the text
Extract the text into the buffer
Convert the text from UTF-16LE to a JavaScript string
Clean up resources

Prerequisites

The examples in this guide use the initializePdfium helper function from our Getting Started guide. This function properly initializes the PDFium library, including calling the required PDFiumExt_Init() function. Make sure to include this initialization code in your application before trying the examples below.

Example Implementation

Here’s a complete implementation for loading a PDF document and extracting text from its pages:


import { initializePdfium } from './initialize-pdfium';
// Note: The initializePdfium function is a helper that initializes the PDFium library.
// For the full implementation, see: /docs/pdfium/getting-started
 
/**
 * Loads a PDF document and returns an object with methods to access and extract text from it.
 * This example focuses solely on extracting text from PDF pages.
 */
export async function loadPdfForTextExtraction(pdfData: Uint8Array) {
  // Initialize PDFium
  const pdfium = await initializePdfium();
  
  // Allocate memory for the PDF data
  const filePtr = pdfium.pdfium.wasmExports.malloc(pdfData.length);
  pdfium.pdfium.HEAPU8.set(pdfData, filePtr);
  
  // Load the document
  const docPtr = pdfium.FPDF_LoadMemDocument(filePtr, pdfData.length, 0);
  
  if (!docPtr) {
    const error = pdfium.FPDF_GetLastError();
    pdfium.pdfium.wasmExports.free(filePtr);
    
    // Handle password-protected documents
    if (error === 4) {
      return {
        hasPassword: true,
        pageCount: 0,
        close: () => {}, // No-op since no document was loaded
        extractText: async () => { throw new Error('Document is password protected'); }
      };
    }
    
    throw new Error(`Failed to load PDF: ${error}`);
  }
  
  // Get page count
  const pageCount = pdfium.FPDF_GetPageCount(docPtr);
  
  // Return an object with document info and text extraction capabilities
  return {
    hasPassword: false,
    pageCount,
    
    // Close the document and free resources
    close: () => {
      pdfium.FPDF_CloseDocument(docPtr);
      pdfium.pdfium.wasmExports.free(filePtr);
    },
    
    // Extract all text from a specific page
    extractText: async (pageIndex: number): Promise<string> => {
      // Check if the page index is valid
      if (pageIndex < 0 || pageIndex >= pageCount) {
        throw new Error(`Invalid page index: ${pageIndex}. Document has ${pageCount} pages.`);
      }
      
      // Load the page for rendering
      const pagePtr = pdfium.FPDF_LoadPage(docPtr, pageIndex);
      if (!pagePtr) {
        throw new Error(`Failed to load page ${pageIndex}`);
      }
      
      try {
        // Load the page text (a separate structure for text extraction)
        const textPagePtr = pdfium.FPDFText_LoadPage(pagePtr);
        if (!textPagePtr) {
          throw new Error(`Failed to load text for page ${pageIndex}`);
        }
        
        try {
          // Get the character count on the page
          const charCount = pdfium.FPDFText_CountChars(textPagePtr);
          if (charCount <= 0) {
            return ''; // No text on this page or error
          }
          
          // Allocate a buffer for the text (+1 for null terminator)
          const bufferSize = (charCount + 1) * 2; // UTF-16, 2 bytes per character
          const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize);
          
          try {
            // Extract the text (PDFium uses UTF-16LE encoding)
            const extractedLength = pdfium.FPDFText_GetText(
              textPagePtr, 
              0, // Start index
              charCount, // Character count to extract
              textBufferPtr // Output buffer
            );
            
            // If text was extracted, convert from UTF-16LE to JavaScript string
            if (extractedLength > 0) {
              // Convert UTF-16 array to JavaScript string
              return pdfium.pdfium.UTF16ToString(textBufferPtr);
            }
            
            return ''; // No text extracted
          } finally {
            // Clean up text buffer
            pdfium.pdfium.wasmExports.free(textBufferPtr);
          }
        } finally {
          // Clean up text page
          pdfium.FPDFText_ClosePage(textPagePtr);
        }
      } finally {
        // Clean up page
        pdfium.FPDF_ClosePage(pagePtr);
      }
    }
  };
}

Usage Example

Here’s a simple example of how to use the text extraction functionality:


// Load a PDF file and extract text from its pages
async function extractTextFromPdf() {
  try {
    // Fetch the PDF file
    const response = await fetch('sample.pdf');
    const arrayBuffer = await response.arrayBuffer();
    const pdfData = new Uint8Array(arrayBuffer);
    
    // Load the document
    const pdfDocument = await loadPdfForTextExtraction(pdfData);
    
    // Get the page count
    const pageCount = pdfDocument.pageCount;
    console.log(`PDF has ${pageCount} pages`);
    
    // Extract text from each page
    for (let i = 0; i < pageCount; i++) {
      const text = await pdfDocument.extractText(i);
      console.log(`--- Page ${i + 1} ---`);
      console.log(text || '[No text on this page]');
    }
    
    // Clean up when done
    pdfDocument.close();
  } catch (error) {
    console.error('Error extracting text:', error);
  }
}
 
extractTextFromPdf();

Memory Management Considerations

When extracting text from PDF pages, proper memory management is critical to prevent memory leaks:

Text page cleanup: Always close text pages using FPDFText_ClosePage when you’re done with them.
Page cleanup: Always close pages using FPDF_ClosePage when you’re done with them.
Buffer cleanup: Free any allocated memory buffers using pdfium.pdfium.wasmExports.free() when you’re done with them.
Nested try/finally blocks: Use nested try/finally blocks to ensure proper cleanup of resources in the correct order, even if errors occur.

Text Extraction Limitations

When working with text extraction in PDFs, be aware of these limitations:

Text recognition: PDFium can only extract text that is actually stored as text in the PDF. It cannot perform OCR (Optical Character Recognition) on scanned documents or images.
Text order: The order of extracted text might not always match the visual order on the page, especially for complex layouts or documents with multiple columns.
Special characters: Some special characters or symbols might not be correctly extracted or might be represented differently in the extracted text.
Password protection: Text cannot be extracted from password-protected PDFs without providing the correct password.

FPDF_LoadMemDocument - Load a PDF document from memory
FPDF_GetPageCount - Get the number of pages in a document
FPDF_LoadPage - Load a specific page from a document
FPDFText_LoadPage - Load a page for text extraction
FPDFText_CountChars - Get the number of characters on a page
FPDFText_GetText - Extract text from a page
FPDFText_ClosePage - Close a text page and release resources
FPDF_ClosePage - Close a page and release resources
FPDF_CloseDocument - Close a document and release resources
FPDF_GetLastError - Get the error code for the last failed operation