FPDFText_LoadPage

FPDFText_LoadPage(page)

Description

Creates a text page object from a loaded PDF page. This function is the first step for any text extraction operations such as getting text content, searching for text, or retrieving character positions. The text page object provides access to the text content and structure of the PDF page.

Prerequisites

This example uses the initializePdfium helper function from our Getting Started guide. Make sure to include this function in your code before trying the examples below.

Parameters

NameTypeDescription
pagenumberA page handle obtained from FPDF_LoadPage.

Return Value

Returns a handle to the text page object on success, or 0 on failure. When a failure occurs, you can call FPDF_GetLastError() to retrieve an error code that provides more information about what went wrong.

Example

// Note: The initializePdfium function is a helper that initializes the PDFium library. // For the full implementation, see: /docs/pdfium/getting-started import { initializePdfium } from './initialize-pdfium'; async function extractTextFromPage(pdfData: Uint8Array, pageIndex: number): Promise<string> { // Initialize PDFium const pdfium = await initializePdfium(); // Load the PDF document const filePtr = pdfium.pdfium.wasmExports.malloc(pdfData.length); pdfium.pdfium.HEAPU8.set(pdfData, filePtr); const docPtr = pdfium.FPDF_LoadMemDocument(filePtr, pdfData.length, 0); if (!docPtr) { const error = pdfium.FPDF_GetLastError(); pdfium.pdfium.wasmExports.free(filePtr); throw new Error(`Failed to load PDF: ${error}`); } try { // Load the specified page const pagePtr = pdfium.FPDF_LoadPage(docPtr, pageIndex); if (!pagePtr) { throw new Error(`Failed to load page ${pageIndex}`); } try { // Create a text page object for text extraction const textPagePtr = pdfium.FPDFText_LoadPage(pagePtr); if (!textPagePtr) { throw new Error(`Failed to load text for page ${pageIndex}`); } try { // Get the character count const charCount = pdfium.FPDFText_CountChars(textPagePtr); if (charCount <= 0) { return ''; // No text on this page } // Allocate a buffer for the text (+1 for null terminator) const bufferSize = (charCount + 1) * 2; // UTF-16, 2 bytes per character const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize); try { // Extract the text const extractedLength = pdfium.FPDFText_GetText( textPagePtr, 0, // Start index charCount, // Character count textBufferPtr ); // Convert the UTF-16LE text to a JavaScript string if (extractedLength > 0) { return pdfium.pdfium.UTF16ToString(textBufferPtr); } return ''; } finally { // Clean up text buffer pdfium.pdfium.wasmExports.free(textBufferPtr); } } finally { // Clean up text page pdfium.FPDFText_ClosePage(textPagePtr); } } finally { // Clean up page pdfium.FPDF_ClosePage(pagePtr); } } finally { // Clean up document pdfium.FPDF_CloseDocument(docPtr); pdfium.pdfium.wasmExports.free(filePtr); } } // Usage fetch('sample.pdf') .then(response => response.arrayBuffer()) .then(buffer => extractTextFromPage(new Uint8Array(buffer), 0)) .then(text => console.log('Extracted text:', text)) .catch(error => console.error('Error:', error));

Best Practices

  1. Always close text pages: When you’re done with a text page object, make sure to call FPDFText_ClosePage to release the resources associated with it. The best pattern is to use a try/finally block.

  2. One text page per PDF page: Create a new text page object for each PDF page you need to process. Text extraction objects aren’t designed to be reused across multiple pages.

  3. Check the return value: Always check if FPDFText_LoadPage returns a valid handle before proceeding with text operations.

  4. Load the page before text extraction: Remember that you need to load the PDF page with FPDF_LoadPage before you can create a text page from it.

  5. Error handling: If FPDFText_LoadPage returns 0, use FPDF_GetLastError() to determine what went wrong.

Common Issues

  1. No text found: Some PDF pages might not contain any text content (they could be images or scanned documents), resulting in a text page with no characters.

  2. Memory leaks: Forgetting to call FPDFText_ClosePage when done with a text page will lead to memory leaks.

  3. Invalid page handle: If you pass an invalid page handle to FPDFText_LoadPage, it will return 0.

  4. Trying to extract text from non-text elements: PDFium can only extract text that is actually present as text in the PDF, not text embedded in images or drawings.

Last updated on March 31, 2025