FPDFText_CountChars
FPDFText_CountChars(text_page)
Description
Gets the number of characters in a PDF page. This function is typically called before allocating a buffer for text extraction, as it tells you how much space you’ll need to accommodate all the text on the page.
Prerequisites
This example uses the initializePdfium
helper function from our Getting Started guide. Make sure to include this function in your code before trying the examples below.
Parameters
Name | Type | Description |
---|---|---|
text_page | number | A text page handle obtained from FPDFText_LoadPage . |
Return Value
Returns the number of characters on the page (including whitespace characters), or -1 on error. Some common error cases include:
- The text_page handle is invalid
- The PDF page contains no text content
A return value of 0 indicates that the page exists but contains no text characters.
Example
// Note: The initializePdfium function is a helper that initializes the PDFium library.
// For the full implementation, see: /docs/pdfium/getting-started
import { initializePdfium } from './initialize-pdfium';
async function getPageCharCount(pdfData: Uint8Array, pageIndex: number): Promise<number> {
// Initialize PDFium
const pdfium = await initializePdfium();
// Load the PDF document
const filePtr = pdfium.pdfium.wasmExports.malloc(pdfData.length);
pdfium.pdfium.HEAPU8.set(pdfData, filePtr);
const docPtr = pdfium.FPDF_LoadMemDocument(filePtr, pdfData.length, 0);
if (!docPtr) {
const error = pdfium.FPDF_GetLastError();
pdfium.pdfium.wasmExports.free(filePtr);
throw new Error(`Failed to load PDF: ${error}`);
}
try {
// Check if the page index is valid
const pageCount = pdfium.FPDF_GetPageCount(docPtr);
if (pageIndex < 0 || pageIndex >= pageCount) {
throw new Error(`Invalid page index: ${pageIndex}. Document has ${pageCount} pages.`);
}
// Load the PDF page
const pagePtr = pdfium.FPDF_LoadPage(docPtr, pageIndex);
if (!pagePtr) {
throw new Error(`Failed to load page ${pageIndex}`);
}
try {
// Create a text page object
const textPagePtr = pdfium.FPDFText_LoadPage(pagePtr);
if (!textPagePtr) {
throw new Error(`Failed to load text for page ${pageIndex}`);
}
try {
// Get the character count
const charCount = pdfium.FPDFText_CountChars(textPagePtr);
if (charCount === -1) {
throw new Error(`Error getting character count for page ${pageIndex}`);
}
return charCount;
} finally {
// Clean up text page
pdfium.FPDFText_ClosePage(textPagePtr);
}
} finally {
// Clean up PDF page
pdfium.FPDF_ClosePage(pagePtr);
}
} finally {
// Clean up document
pdfium.FPDF_CloseDocument(docPtr);
pdfium.pdfium.wasmExports.free(filePtr);
}
}
// Usage
fetch('sample.pdf')
.then(response => response.arrayBuffer())
.then(buffer => getPageCharCount(new Uint8Array(buffer), 0))
.then(count => {
if (count === 0) {
console.log('The page exists but contains no text.');
} else {
console.log(`The page contains ${count} characters.`);
}
})
.catch(error => console.error('Error:', error));
Usage Examples
Allocating a buffer for text extraction
// Get the character count
const charCount = pdfium.FPDFText_CountChars(textPagePtr);
if (charCount <= 0) {
return ''; // No text or error
}
// Allocate a buffer for the text (+1 for null terminator)
const bufferSize = (charCount + 1) * 2; // UTF-16, 2 bytes per character
const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize);
try {
// Extract text into the buffer
pdfium.FPDFText_GetText(textPagePtr, 0, charCount, textBufferPtr);
// ...
} finally {
// Clean up buffer
pdfium.pdfium.wasmExports.free(textBufferPtr);
}
Determining if a page contains text
function pageHasText(pdfium, textPagePtr) {
const charCount = pdfium.FPDFText_CountChars(textPagePtr);
return charCount > 0;
}
Best Practices
-
Check for errors: Always check if
FPDFText_CountChars
returns -1, which indicates an error. -
Handle empty pages: A return value of 0 is valid and means the page exists but contains no text. Your code should handle this case gracefully.
-
Buffer allocation: When allocating memory for text extraction, always add 1 to the character count to accommodate the null terminator required by PDFium.
-
Use for capacity planning: This function is useful for determining how much memory to allocate for text extraction, allowing you to avoid buffer overflows.
Common Issues
-
Invalid text page handle: If you pass an invalid text page handle to
FPDFText_CountChars
, it will return -1. -
Confusing character count with byte length: Remember that PDFium uses UTF-16LE encoding for text, so each character typically requires 2 bytes of storage. When allocating memory for text extraction, multiply the character count by 2.
-
Scanned documents: PDFs created from scanned images won’t have text content unless OCR has been applied.
FPDFText_CountChars
will return 0 for these pages.
Related Functions
- FPDFText_LoadPage - Load a page for text extraction
- FPDFText_GetText - Extract text from a page
- FPDFText_ClosePage - Close a text page and release resources