FPDFText_GetText

FPDFText_GetText(text_page, start_index, count, result_buffer)

Description

Extracts text from a PDF page into a buffer. This function is used to get the actual text content from a page after creating a text page object with FPDFText_LoadPage. The extracted text is encoded in UTF-16LE format (2 bytes per character).

Prerequisites

This example uses the initializePdfium helper function from our Getting Started guide. Make sure to include this function in your code before trying the examples below.

Parameters

Name	Type	Description
text_page	number	A text page handle obtained from `FPDFText_LoadPage`.
start_index	number	The 0-based index of the first character to extract.
count	number	The number of characters to extract.
result_buffer	number	Pointer to a buffer to receive the text. Must be allocated with enough space to hold the requested text plus a null terminator.

Return Value

Returns the number of characters written to the buffer (including the terminating null character), or 0 on error. Common error cases include:

The text_page handle is invalid
The start_index is out of range
The count is negative or extends beyond the available text
The result_buffer is invalid or not large enough

Example


// Note: The initializePdfium function is a helper that initializes the PDFium library.
// For the full implementation, see: /docs/pdfium/getting-started
import { initializePdfium } from './initialize-pdfium';
 
async function extractTextFromPage(pdfData: Uint8Array, pageIndex: number): Promise<string> {
  // Initialize PDFium
  const pdfium = await initializePdfium();
  
  // Load the PDF document
  const filePtr = pdfium.pdfium.wasmExports.malloc(pdfData.length);
  pdfium.pdfium.HEAPU8.set(pdfData, filePtr);
  const docPtr = pdfium.FPDF_LoadMemDocument(filePtr, pdfData.length, 0);
  
  if (!docPtr) {
    const error = pdfium.FPDF_GetLastError();
    pdfium.pdfium.wasmExports.free(filePtr);
    throw new Error(`Failed to load PDF: ${error}`);
  }
  
  try {
    // Load the PDF page
    const pagePtr = pdfium.FPDF_LoadPage(docPtr, pageIndex);
    if (!pagePtr) {
      throw new Error(`Failed to load page ${pageIndex}`);
    }
    
    try {
      // Create a text page object
      const textPagePtr = pdfium.FPDFText_LoadPage(pagePtr);
      if (!textPagePtr) {
        throw new Error(`Failed to load text for page ${pageIndex}`);
      }
      
      try {
        // Get the character count
        const charCount = pdfium.FPDFText_CountChars(textPagePtr);
        if (charCount <= 0) {
          return ''; // No text on this page or error
        }
        
        // Allocate a buffer for the text (+1 for null terminator)
        const bufferSize = (charCount + 1) * 2; // UTF-16, 2 bytes per character
        const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize);
        
        try {
          // Extract all text from the page
          const extractedLength = pdfium.FPDFText_GetText(
            textPagePtr,
            0,          // Start from first character
            charCount,  // Get all characters
            textBufferPtr
          );
          
          if (extractedLength === 0) {
            throw new Error('Failed to extract text from page');
          }
          
          // Convert the UTF-16LE text to a JavaScript string
          return pdfium.pdfium.UTF16ToString(textBufferPtr);
        } finally {
          // Clean up text buffer
          pdfium.pdfium.wasmExports.free(textBufferPtr);
        }
      } finally {
        // Clean up text page
        pdfium.FPDFText_ClosePage(textPagePtr);
      }
    } finally {
      // Clean up PDF page
      pdfium.FPDF_ClosePage(pagePtr);
    }
  } finally {
    // Clean up document
    pdfium.FPDF_CloseDocument(docPtr);
    pdfium.pdfium.wasmExports.free(filePtr);
  }
}
 
// Usage
fetch('sample.pdf')
  .then(response => response.arrayBuffer())
  .then(buffer => extractTextFromPage(new Uint8Array(buffer), 0))
  .then(text => console.log('Extracted text:', text))
  .catch(error => console.error('Error:', error));

Usage Examples

Extracting a specific range of text


// Extract text from characters 10 to 20 (11 characters)
const startIndex = 10;
const count = 11;
const bufferSize = (count + 1) * 2; // +1 for null terminator, *2 for UTF-16
const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize);
 
try {
  const extractedLength = pdfium.FPDFText_GetText(
    textPagePtr,
    startIndex,
    count,
    textBufferPtr
  );
  
  if (extractedLength > 0) {
    const text = pdfium.pdfium.UTF16ToString(textBufferPtr);
    console.log(`Extracted snippet: "${text}"`);
  }
} finally {
  pdfium.pdfium.wasmExports.free(textBufferPtr);
}

Best Practices

Proper buffer sizing: Always allocate a buffer that’s large enough to hold the requested text plus a null terminator. Each character in PDFium is UTF-16LE encoded, requiring 2 bytes per character.
Check return value: Always check if FPDFText_GetText returns 0, which indicates an error.
Get character count first: Before calling FPDFText_GetText, use FPDFText_CountChars to find out how many characters are available on the page.
Free buffer memory: Always free the memory allocated for the text buffer when you’re done with it.
Handle empty text gracefully: Some pages might not contain any text, so your code should handle empty strings gracefully.

Common Issues

Buffer overflow: If the buffer is not large enough to hold the requested text plus a null terminator, FPDFText_GetText may fail or cause memory corruption.
Invalid range: If start_index is negative or start_index + count exceeds the number of characters on the page, the function will fail.
Character encoding: The text is returned in UTF-16LE encoding, not UTF-8. Make sure to convert it properly if your application expects UTF-8.
Memory leaks: Always free the memory allocated for the text buffer when you’re done with it.

FPDFText_LoadPage - Load a page for text extraction
FPDFText_CountChars - Get the number of characters on a page
FPDFText_ClosePage - Close a text page and release resources