Extracting Text from PDF Pages
Description
Text extraction is one of the most common operations when working with PDFs. PDFium provides powerful text extraction capabilities through its WebAssembly API. This guide explains how to extract text from PDF pages, which is useful for searching, indexing, and making PDF content accessible.
Interactive Example
This interactive example demonstrates how to extract text from PDF pages. You can navigate between pages to see the extracted text from each page:
Extracted Text:
No text extracted yet.
The Basic Workflow
Extracting text from a PDF page involves several steps:
- Initialize PDFium
- Load the PDF document
- Load the specific page
- Create a text page object for text extraction
- Get the character count
- Allocate a buffer for the text
- Extract the text into the buffer
- Convert the text from UTF-16LE to a JavaScript string
- Clean up resources
Prerequisites
The examples in this guide use the initializePdfium
helper function from our Getting Started guide. This function properly initializes the PDFium library, including calling the required PDFiumExt_Init()
function. Make sure to include this initialization code in your application before trying the examples below.
Example Implementation
Here’s a complete implementation for loading a PDF document and extracting text from its pages:
import { initializePdfium } from './initialize-pdfium';
// Note: The initializePdfium function is a helper that initializes the PDFium library.
// For the full implementation, see: /docs/pdfium/getting-started
/**
* Loads a PDF document and returns an object with methods to access and extract text from it.
* This example focuses solely on extracting text from PDF pages.
*/
export async function loadPdfForTextExtraction(pdfData: Uint8Array) {
// Initialize PDFium
const pdfium = await initializePdfium();
// Allocate memory for the PDF data
const filePtr = pdfium.pdfium.wasmExports.malloc(pdfData.length);
pdfium.pdfium.HEAPU8.set(pdfData, filePtr);
// Load the document
const docPtr = pdfium.FPDF_LoadMemDocument(filePtr, pdfData.length, 0);
if (!docPtr) {
const error = pdfium.FPDF_GetLastError();
pdfium.pdfium.wasmExports.free(filePtr);
// Handle password-protected documents
if (error === 4) {
return {
hasPassword: true,
pageCount: 0,
close: () => {}, // No-op since no document was loaded
extractText: async () => { throw new Error('Document is password protected'); }
};
}
throw new Error(`Failed to load PDF: ${error}`);
}
// Get page count
const pageCount = pdfium.FPDF_GetPageCount(docPtr);
// Return an object with document info and text extraction capabilities
return {
hasPassword: false,
pageCount,
// Close the document and free resources
close: () => {
pdfium.FPDF_CloseDocument(docPtr);
pdfium.pdfium.wasmExports.free(filePtr);
},
// Extract all text from a specific page
extractText: async (pageIndex: number): Promise<string> => {
// Check if the page index is valid
if (pageIndex < 0 || pageIndex >= pageCount) {
throw new Error(`Invalid page index: ${pageIndex}. Document has ${pageCount} pages.`);
}
// Load the page for rendering
const pagePtr = pdfium.FPDF_LoadPage(docPtr, pageIndex);
if (!pagePtr) {
throw new Error(`Failed to load page ${pageIndex}`);
}
try {
// Load the page text (a separate structure for text extraction)
const textPagePtr = pdfium.FPDFText_LoadPage(pagePtr);
if (!textPagePtr) {
throw new Error(`Failed to load text for page ${pageIndex}`);
}
try {
// Get the character count on the page
const charCount = pdfium.FPDFText_CountChars(textPagePtr);
if (charCount <= 0) {
return ''; // No text on this page or error
}
// Allocate a buffer for the text (+1 for null terminator)
const bufferSize = (charCount + 1) * 2; // UTF-16, 2 bytes per character
const textBufferPtr = pdfium.pdfium.wasmExports.malloc(bufferSize);
try {
// Extract the text (PDFium uses UTF-16LE encoding)
const extractedLength = pdfium.FPDFText_GetText(
textPagePtr,
0, // Start index
charCount, // Character count to extract
textBufferPtr // Output buffer
);
// If text was extracted, convert from UTF-16LE to JavaScript string
if (extractedLength > 0) {
// Convert UTF-16 array to JavaScript string
return pdfium.pdfium.UTF16ToString(textBufferPtr);
}
return ''; // No text extracted
} finally {
// Clean up text buffer
pdfium.pdfium.wasmExports.free(textBufferPtr);
}
} finally {
// Clean up text page
pdfium.FPDFText_ClosePage(textPagePtr);
}
} finally {
// Clean up page
pdfium.FPDF_ClosePage(pagePtr);
}
}
};
}
Usage Example
Here’s a simple example of how to use the text extraction functionality:
// Load a PDF file and extract text from its pages
async function extractTextFromPdf() {
try {
// Fetch the PDF file
const response = await fetch('sample.pdf');
const arrayBuffer = await response.arrayBuffer();
const pdfData = new Uint8Array(arrayBuffer);
// Load the document
const pdfDocument = await loadPdfForTextExtraction(pdfData);
// Get the page count
const pageCount = pdfDocument.pageCount;
console.log(`PDF has ${pageCount} pages`);
// Extract text from each page
for (let i = 0; i < pageCount; i++) {
const text = await pdfDocument.extractText(i);
console.log(`--- Page ${i + 1} ---`);
console.log(text || '[No text on this page]');
}
// Clean up when done
pdfDocument.close();
} catch (error) {
console.error('Error extracting text:', error);
}
}
extractTextFromPdf();
Memory Management Considerations
When extracting text from PDF pages, proper memory management is critical to prevent memory leaks:
-
Text page cleanup: Always close text pages using
FPDFText_ClosePage
when you’re done with them. -
Page cleanup: Always close pages using
FPDF_ClosePage
when you’re done with them. -
Buffer cleanup: Free any allocated memory buffers using
pdfium.pdfium.wasmExports.free()
when you’re done with them. -
Nested try/finally blocks: Use nested try/finally blocks to ensure proper cleanup of resources in the correct order, even if errors occur.
Text Extraction Limitations
When working with text extraction in PDFs, be aware of these limitations:
-
Text recognition: PDFium can only extract text that is actually stored as text in the PDF. It cannot perform OCR (Optical Character Recognition) on scanned documents or images.
-
Text order: The order of extracted text might not always match the visual order on the page, especially for complex layouts or documents with multiple columns.
-
Special characters: Some special characters or symbols might not be correctly extracted or might be represented differently in the extracted text.
-
Password protection: Text cannot be extracted from password-protected PDFs without providing the correct password.
Related Functions
- FPDF_LoadMemDocument - Load a PDF document from memory
- FPDF_GetPageCount - Get the number of pages in a document
- FPDF_LoadPage - Load a specific page from a document
- FPDFText_LoadPage - Load a page for text extraction
- FPDFText_CountChars - Get the number of characters on a page
- FPDFText_GetText - Extract text from a page
- FPDFText_ClosePage - Close a text page and release resources
- FPDF_ClosePage - Close a page and release resources
- FPDF_CloseDocument - Close a document and release resources
- FPDF_GetLastError - Get the error code for the last failed operation