FPDFText_GetText
FPDFText_GetText(text_page, start_index, count, result_buffer)
Description
Extracts text from a PDF page into a buffer. This function is used to get the actual text content from a page after creating a text page object with FPDFText_LoadPage
. The extracted text is encoded in UTF-16LE format (2 bytes per character).
Prerequisites
This example uses the initializePdfium
helper function from our Getting Started guide. Make sure to include this function in your code before trying the examples below.
Parameters
Name | Type | Description |
---|---|---|
text_page | number | A text page handle obtained from FPDFText_LoadPage . |
start_index | number | The 0-based index of the first character to extract. |
count | number | The number of characters to extract. |
result_buffer | number | Pointer to a buffer to receive the text. Must be allocated with enough space to hold the requested text plus a null terminator. |
Return Value
Returns the number of characters written to the buffer (including the terminating null character), or 0 on error. Common error cases include:
- The text_page handle is invalid
- The start_index is out of range
- The count is negative or extends beyond the available text
- The result_buffer is invalid or not large enough
Example
// Note: The initializePdfium function is a helper that initializes the PDFium library.
// For the full implementation, see: /docs/pdfium/getting-started
import { initializePdfium } from './initialize-pdfium';
async function extractTextFromPage(pdfData: Uint8Array, pageIndex: number): Promise<string> {
// Initialize PDFium
const pdfium = await initializePdfium();
// Load the PDF document
const filePtr = pdfium.pdfium.wasmExports.malloc(pdfData.length);
pdfium.pdfium.HEAPU8.set(pdfData, filePtr);
const docPtr = pdfium.FPDF_LoadMemDocument(filePtr, pdfData.length, 0);
if (!docPtr) {
const error = pdfium.FPDF_GetLastError();
pdfium.pdfium.wasmExports.free(filePtr);
throw new Error(`Failed to load PDF: ${error}`);
}
try {
// Load the PDF page
const pagePtr = pdfium.FPDF_LoadPage(docPtr, pageIndex);
if (!pagePtr) {
throw new Error(`Failed to load page ${pageIndex}`);
}
try {
// Create a text page object
const textPagePtr = pdfium.FPDFText_LoadPage(pagePtr);
if (!textPagePtr) {
throw new Error(`Failed to load text for page ${pageIndex}`);
}
try {
// Get the character count
const charCount = pdfium.FPDFText_CountChars(textPagePtr);
if (charCount <= 0) {
return ''; // No text on this page or error
}
// Allocate a buffer for the text (+1 for null terminator)
const bufferSize = (charCount + 1) * 2; // UTF-16, 2 bytes per character
const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize);
try {
// Extract all text from the page
const extractedLength = pdfium.FPDFText_GetText(
textPagePtr,
0, // Start from first character
charCount, // Get all characters
textBufferPtr
);
if (extractedLength === 0) {
throw new Error('Failed to extract text from page');
}
// Convert the UTF-16LE text to a JavaScript string
return pdfium.pdfium.UTF16ToString(textBufferPtr);
} finally {
// Clean up text buffer
pdfium.pdfium.wasmExports.free(textBufferPtr);
}
} finally {
// Clean up text page
pdfium.FPDFText_ClosePage(textPagePtr);
}
} finally {
// Clean up PDF page
pdfium.FPDF_ClosePage(pagePtr);
}
} finally {
// Clean up document
pdfium.FPDF_CloseDocument(docPtr);
pdfium.pdfium.wasmExports.free(filePtr);
}
}
// Usage
fetch('sample.pdf')
.then(response => response.arrayBuffer())
.then(buffer => extractTextFromPage(new Uint8Array(buffer), 0))
.then(text => console.log('Extracted text:', text))
.catch(error => console.error('Error:', error));
Usage Examples
Extracting a specific range of text
// Extract text from characters 10 to 20 (11 characters)
const startIndex = 10;
const count = 11;
const bufferSize = (count + 1) * 2; // +1 for null terminator, *2 for UTF-16
const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize);
try {
const extractedLength = pdfium.FPDFText_GetText(
textPagePtr,
startIndex,
count,
textBufferPtr
);
if (extractedLength > 0) {
const text = pdfium.pdfium.UTF16ToString(textBufferPtr);
console.log(`Extracted snippet: "${text}"`);
}
} finally {
pdfium.pdfium.wasmExports.free(textBufferPtr);
}
Best Practices
-
Proper buffer sizing: Always allocate a buffer that’s large enough to hold the requested text plus a null terminator. Each character in PDFium is UTF-16LE encoded, requiring 2 bytes per character.
-
Check return value: Always check if
FPDFText_GetText
returns 0, which indicates an error. -
Get character count first: Before calling
FPDFText_GetText
, useFPDFText_CountChars
to find out how many characters are available on the page. -
Free buffer memory: Always free the memory allocated for the text buffer when you’re done with it.
-
Handle empty text gracefully: Some pages might not contain any text, so your code should handle empty strings gracefully.
Common Issues
-
Buffer overflow: If the buffer is not large enough to hold the requested text plus a null terminator,
FPDFText_GetText
may fail or cause memory corruption. -
Invalid range: If
start_index
is negative orstart_index + count
exceeds the number of characters on the page, the function will fail. -
Character encoding: The text is returned in UTF-16LE encoding, not UTF-8. Make sure to convert it properly if your application expects UTF-8.
-
Memory leaks: Always free the memory allocated for the text buffer when you’re done with it.
Related Functions
- FPDFText_LoadPage - Load a page for text extraction
- FPDFText_CountChars - Get the number of characters on a page
- FPDFText_ClosePage - Close a text page and release resources