FPDFText_GetText

FPDFText_GetText(text_page, start_index, count, result_buffer)

Description

Extracts text from a PDF page into a buffer. This function is used to get the actual text content from a page after creating a text page object with FPDFText_LoadPage. The extracted text is encoded in UTF-16LE format (2 bytes per character).

Prerequisites

This example uses the initializePdfium helper function from our Getting Started guide. Make sure to include this function in your code before trying the examples below.

Parameters

NameTypeDescription
text_pagenumberA text page handle obtained from FPDFText_LoadPage.
start_indexnumberThe 0-based index of the first character to extract.
countnumberThe number of characters to extract.
result_buffernumberPointer to a buffer to receive the text. Must be allocated with enough space to hold the requested text plus a null terminator.

Return Value

Returns the number of characters written to the buffer (including the terminating null character), or 0 on error. Common error cases include:

  • The text_page handle is invalid
  • The start_index is out of range
  • The count is negative or extends beyond the available text
  • The result_buffer is invalid or not large enough

Example

// Note: The initializePdfium function is a helper that initializes the PDFium library. // For the full implementation, see: /docs/pdfium/getting-started import { initializePdfium } from './initialize-pdfium'; async function extractTextFromPage(pdfData: Uint8Array, pageIndex: number): Promise<string> { // Initialize PDFium const pdfium = await initializePdfium(); // Load the PDF document const filePtr = pdfium.pdfium.wasmExports.malloc(pdfData.length); pdfium.pdfium.HEAPU8.set(pdfData, filePtr); const docPtr = pdfium.FPDF_LoadMemDocument(filePtr, pdfData.length, 0); if (!docPtr) { const error = pdfium.FPDF_GetLastError(); pdfium.pdfium.wasmExports.free(filePtr); throw new Error(`Failed to load PDF: ${error}`); } try { // Load the PDF page const pagePtr = pdfium.FPDF_LoadPage(docPtr, pageIndex); if (!pagePtr) { throw new Error(`Failed to load page ${pageIndex}`); } try { // Create a text page object const textPagePtr = pdfium.FPDFText_LoadPage(pagePtr); if (!textPagePtr) { throw new Error(`Failed to load text for page ${pageIndex}`); } try { // Get the character count const charCount = pdfium.FPDFText_CountChars(textPagePtr); if (charCount <= 0) { return ''; // No text on this page or error } // Allocate a buffer for the text (+1 for null terminator) const bufferSize = (charCount + 1) * 2; // UTF-16, 2 bytes per character const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize); try { // Extract all text from the page const extractedLength = pdfium.FPDFText_GetText( textPagePtr, 0, // Start from first character charCount, // Get all characters textBufferPtr ); if (extractedLength === 0) { throw new Error('Failed to extract text from page'); } // Convert the UTF-16LE text to a JavaScript string return pdfium.pdfium.UTF16ToString(textBufferPtr); } finally { // Clean up text buffer pdfium.pdfium.wasmExports.free(textBufferPtr); } } finally { // Clean up text page pdfium.FPDFText_ClosePage(textPagePtr); } } finally { // Clean up PDF page pdfium.FPDF_ClosePage(pagePtr); } } finally { // Clean up document pdfium.FPDF_CloseDocument(docPtr); pdfium.pdfium.wasmExports.free(filePtr); } } // Usage fetch('sample.pdf') .then(response => response.arrayBuffer()) .then(buffer => extractTextFromPage(new Uint8Array(buffer), 0)) .then(text => console.log('Extracted text:', text)) .catch(error => console.error('Error:', error));

Usage Examples

Extracting a specific range of text

// Extract text from characters 10 to 20 (11 characters) const startIndex = 10; const count = 11; const bufferSize = (count + 1) * 2; // +1 for null terminator, *2 for UTF-16 const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize); try { const extractedLength = pdfium.FPDFText_GetText( textPagePtr, startIndex, count, textBufferPtr ); if (extractedLength > 0) { const text = pdfium.pdfium.UTF16ToString(textBufferPtr); console.log(`Extracted snippet: "${text}"`); } } finally { pdfium.pdfium.wasmExports.free(textBufferPtr); }

Best Practices

  1. Proper buffer sizing: Always allocate a buffer that’s large enough to hold the requested text plus a null terminator. Each character in PDFium is UTF-16LE encoded, requiring 2 bytes per character.

  2. Check return value: Always check if FPDFText_GetText returns 0, which indicates an error.

  3. Get character count first: Before calling FPDFText_GetText, use FPDFText_CountChars to find out how many characters are available on the page.

  4. Free buffer memory: Always free the memory allocated for the text buffer when you’re done with it.

  5. Handle empty text gracefully: Some pages might not contain any text, so your code should handle empty strings gracefully.

Common Issues

  1. Buffer overflow: If the buffer is not large enough to hold the requested text plus a null terminator, FPDFText_GetText may fail or cause memory corruption.

  2. Invalid range: If start_index is negative or start_index + count exceeds the number of characters on the page, the function will fail.

  3. Character encoding: The text is returned in UTF-16LE encoding, not UTF-8. Make sure to convert it properly if your application expects UTF-8.

  4. Memory leaks: Always free the memory allocated for the text buffer when you’re done with it.

Last updated on March 31, 2025