FPDFText_LoadPage
FPDFText_LoadPage(page)
Description
Creates a text page object from a loaded PDF page. This function is the first step for any text extraction operations such as getting text content, searching for text, or retrieving character positions. The text page object provides access to the text content and structure of the PDF page.
Prerequisites
This example uses the initializePdfium
helper function from our Getting Started guide. Make sure to include this function in your code before trying the examples below.
Parameters
Name | Type | Description |
---|---|---|
page | number | A page handle obtained from FPDF_LoadPage . |
Return Value
Returns a handle to the text page object on success, or 0 on failure. When a failure occurs, you can call FPDF_GetLastError()
to retrieve an error code that provides more information about what went wrong.
Example
// Note: The initializePdfium function is a helper that initializes the PDFium library.
// For the full implementation, see: /docs/pdfium/getting-started
import { initializePdfium } from './initialize-pdfium';
async function extractTextFromPage(pdfData: Uint8Array, pageIndex: number): Promise<string> {
// Initialize PDFium
const pdfium = await initializePdfium();
// Load the PDF document
const filePtr = pdfium.pdfium.wasmExports.malloc(pdfData.length);
pdfium.pdfium.HEAPU8.set(pdfData, filePtr);
const docPtr = pdfium.FPDF_LoadMemDocument(filePtr, pdfData.length, 0);
if (!docPtr) {
const error = pdfium.FPDF_GetLastError();
pdfium.pdfium.wasmExports.free(filePtr);
throw new Error(`Failed to load PDF: ${error}`);
}
try {
// Load the specified page
const pagePtr = pdfium.FPDF_LoadPage(docPtr, pageIndex);
if (!pagePtr) {
throw new Error(`Failed to load page ${pageIndex}`);
}
try {
// Create a text page object for text extraction
const textPagePtr = pdfium.FPDFText_LoadPage(pagePtr);
if (!textPagePtr) {
throw new Error(`Failed to load text for page ${pageIndex}`);
}
try {
// Get the character count
const charCount = pdfium.FPDFText_CountChars(textPagePtr);
if (charCount <= 0) {
return ''; // No text on this page
}
// Allocate a buffer for the text (+1 for null terminator)
const bufferSize = (charCount + 1) * 2; // UTF-16, 2 bytes per character
const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize);
try {
// Extract the text
const extractedLength = pdfium.FPDFText_GetText(
textPagePtr,
0, // Start index
charCount, // Character count
textBufferPtr
);
// Convert the UTF-16LE text to a JavaScript string
if (extractedLength > 0) {
return pdfium.pdfium.UTF16ToString(textBufferPtr);
}
return '';
} finally {
// Clean up text buffer
pdfium.pdfium.wasmExports.free(textBufferPtr);
}
} finally {
// Clean up text page
pdfium.FPDFText_ClosePage(textPagePtr);
}
} finally {
// Clean up page
pdfium.FPDF_ClosePage(pagePtr);
}
} finally {
// Clean up document
pdfium.FPDF_CloseDocument(docPtr);
pdfium.pdfium.wasmExports.free(filePtr);
}
}
// Usage
fetch('sample.pdf')
.then(response => response.arrayBuffer())
.then(buffer => extractTextFromPage(new Uint8Array(buffer), 0))
.then(text => console.log('Extracted text:', text))
.catch(error => console.error('Error:', error));
Best Practices
-
Always close text pages: When you’re done with a text page object, make sure to call
FPDFText_ClosePage
to release the resources associated with it. The best pattern is to use a try/finally block. -
One text page per PDF page: Create a new text page object for each PDF page you need to process. Text extraction objects aren’t designed to be reused across multiple pages.
-
Check the return value: Always check if
FPDFText_LoadPage
returns a valid handle before proceeding with text operations. -
Load the page before text extraction: Remember that you need to load the PDF page with
FPDF_LoadPage
before you can create a text page from it. -
Error handling: If
FPDFText_LoadPage
returns 0, useFPDF_GetLastError()
to determine what went wrong.
Common Issues
-
No text found: Some PDF pages might not contain any text content (they could be images or scanned documents), resulting in a text page with no characters.
-
Memory leaks: Forgetting to call
FPDFText_ClosePage
when done with a text page will lead to memory leaks. -
Invalid page handle: If you pass an invalid page handle to
FPDFText_LoadPage
, it will return 0. -
Trying to extract text from non-text elements: PDFium can only extract text that is actually present as text in the PDF, not text embedded in images or drawings.
Related Functions
- FPDF_LoadPage - Load a PDF page
- FPDFText_ClosePage - Close a text page
- FPDFText_CountChars - Get the number of characters on a page
- FPDFText_GetText - Extract text from a page
- FPDF_ClosePage - Close a PDF page