FPDFText_LoadPage

FPDFText_LoadPage(page)

Description

Creates a text page object from a loaded PDF page. This function is the first step for any text extraction operations such as getting text content, searching for text, or retrieving character positions. The text page object provides access to the text content and structure of the PDF page.

Prerequisites

This example uses the initializePdfium helper function from our Getting Started guide. Make sure to include this function in your code before trying the examples below.

Parameters

Name	Type	Description
page	number	A page handle obtained from `FPDF_LoadPage`.

Return Value

Returns a handle to the text page object on success, or 0 on failure. When a failure occurs, you can call FPDF_GetLastError() to retrieve an error code that provides more information about what went wrong.

Example


// Note: The initializePdfium function is a helper that initializes the PDFium library.
// For the full implementation, see: /docs/pdfium/getting-started
import { initializePdfium } from './initialize-pdfium';
 
async function extractTextFromPage(pdfData: Uint8Array, pageIndex: number): Promise<string> {
  // Initialize PDFium
  const pdfium = await initializePdfium();
  
  // Load the PDF document
  const filePtr = pdfium.pdfium.wasmExports.malloc(pdfData.length);
  pdfium.pdfium.HEAPU8.set(pdfData, filePtr);
  const docPtr = pdfium.FPDF_LoadMemDocument(filePtr, pdfData.length, 0);
  
  if (!docPtr) {
    const error = pdfium.FPDF_GetLastError();
    pdfium.pdfium.wasmExports.free(filePtr);
    throw new Error(`Failed to load PDF: ${error}`);
  }
  
  try {
    // Load the specified page
    const pagePtr = pdfium.FPDF_LoadPage(docPtr, pageIndex);
    if (!pagePtr) {
      throw new Error(`Failed to load page ${pageIndex}`);
    }
    
    try {
      // Create a text page object for text extraction
      const textPagePtr = pdfium.FPDFText_LoadPage(pagePtr);
      if (!textPagePtr) {
        throw new Error(`Failed to load text for page ${pageIndex}`);
      }
      
      try {
        // Get the character count
        const charCount = pdfium.FPDFText_CountChars(textPagePtr);
        if (charCount <= 0) {
          return ''; // No text on this page
        }
        
        // Allocate a buffer for the text (+1 for null terminator)
        const bufferSize = (charCount + 1) * 2; // UTF-16, 2 bytes per character
        const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize);
        
        try {
          // Extract the text
          const extractedLength = pdfium.FPDFText_GetText(
            textPagePtr,
            0,          // Start index
            charCount,  // Character count
            textBufferPtr
          );
          
          // Convert the UTF-16LE text to a JavaScript string
          if (extractedLength > 0) {
            return pdfium.pdfium.UTF16ToString(textBufferPtr);
          }
          
          return '';
        } finally {
          // Clean up text buffer
          pdfium.pdfium.wasmExports.free(textBufferPtr);
        }
      } finally {
        // Clean up text page
        pdfium.FPDFText_ClosePage(textPagePtr);
      }
    } finally {
      // Clean up page
      pdfium.FPDF_ClosePage(pagePtr);
    }
  } finally {
    // Clean up document
    pdfium.FPDF_CloseDocument(docPtr);
    pdfium.pdfium.wasmExports.free(filePtr);
  }
}
 
// Usage
fetch('sample.pdf')
  .then(response => response.arrayBuffer())
  .then(buffer => extractTextFromPage(new Uint8Array(buffer), 0))
  .then(text => console.log('Extracted text:', text))
  .catch(error => console.error('Error:', error));

Best Practices

Always close text pages: When you’re done with a text page object, make sure to call FPDFText_ClosePage to release the resources associated with it. The best pattern is to use a try/finally block.
One text page per PDF page: Create a new text page object for each PDF page you need to process. Text extraction objects aren’t designed to be reused across multiple pages.
Check the return value: Always check if FPDFText_LoadPage returns a valid handle before proceeding with text operations.
Load the page before text extraction: Remember that you need to load the PDF page with FPDF_LoadPage before you can create a text page from it.
Error handling: If FPDFText_LoadPage returns 0, use FPDF_GetLastError() to determine what went wrong.

Common Issues

No text found: Some PDF pages might not contain any text content (they could be images or scanned documents), resulting in a text page with no characters.
Memory leaks: Forgetting to call FPDFText_ClosePage when done with a text page will lead to memory leaks.
Invalid page handle: If you pass an invalid page handle to FPDFText_LoadPage, it will return 0.
Trying to extract text from non-text elements: PDFium can only extract text that is actually present as text in the PDF, not text embedded in images or drawings.

FPDF_LoadPage - Load a PDF page
FPDFText_ClosePage - Close a text page
FPDFText_CountChars - Get the number of characters on a page
FPDFText_GetText - Extract text from a page
FPDF_ClosePage - Close a PDF page