C# Extract Text from PDF: A Complete Guide
Extracting text from PDF files is a common requirement in both office automation and software development. Manually copying and pasting text is slow and impractical when handling large volumes of documents. Traditional automation approaches often rely on external software like Adobe Reader, which adds deployment complexity and struggles with encrypted or scanned PDFs.
This article demonstrates how to use Free Spire.PDF for .NET—a lightweight, dependency‑free library—to extract text from PDFs with high accuracy and reliability. We’ll walk through environment setup, core code, and advanced techniques, all with ready‑to‑run examples.
Why Choose Spire.PDF?
| Traditional Solutions | Spire.PDF Solution |
|---|---|
| Require installed software (Adobe Reader, etc.) | Fully self‑contained .NET assembly |
| No or limited support for encrypted PDFs | Load password‑protected files with a single parameter |
| Complex COM interop, verbose code | Pure .NET API, clean and intuitive |
| Sparse or scattered documentation | Comprehensive API reference and examples |
Step‑by‑Step Tutorial
1. Environment Setup
Create a new .NET Console Application (supports .NET Framework 4.6.1+ or .NET Core 3.1+). Then install the FreeSpire.PDF NuGet package:
Install-Package FreeSpire.PDF
Note: The free version processes up to 10 pages per document—ideal for personal or small‑scale projects.
2. Extract Text from a Single Page
The following example loads a PDF, extracts all text from a specific page, and saves it to a .txt file.
using System.IO;
using Spire.Pdf;
using Spire.Pdf.Texts;
namespace ExtractTextFromPage
{
class Program
{
static void Main(string[] args)
{
// 1. Load the PDF document
PdfDocument doc = new PdfDocument();
doc.LoadFromFile("Sample.pdf");
// 2. Get the desired page (0‑based index: Pages[1] = second page)
PdfPageBase page = doc.Pages[1];
// 3. Create a text extractor for the page
PdfTextExtractor textExtractor = new PdfTextExtractor(page);
// 4. Configure extraction options – here we extract all text
PdfTextExtractOptions extractOptions = new PdfTextExtractOptions
{
IsExtractAllText = true
};
// 5. Perform extraction
string text = textExtractor.ExtractText(extractOptions);
// 6. Write the result to a file
File.WriteAllText("Extracted_Single_Page_Text.txt", text);
// 7. Clean up
doc.Close();
}
}
}
Key points:
PdfTextExtractoris bound to a specific page.PdfTextExtractOptionslets you control the extraction area (entire page or a rectangle).ExtractText()returns a string containing all text from that page.
3. Advanced Techniques
Handling Encrypted PDFs
If the PDF is password‑protected, supply the password when loading:
doc.LoadFromFile("Encrypted.pdf", "password");
Extracting Text from All Pages
Loop through every page and concatenate the results:
StringBuilder allText = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
PdfTextExtractor extractor = new PdfTextExtractor(page);
PdfTextExtractOptions options = new PdfTextExtractOptions
{
IsExtractAllText = true
};
allText.AppendLine(extractor.ExtractText(options));
}
File.WriteAllText("Extracted_All_Pages_Text.txt", allText.ToString());
Extracting Text from a Specific Area
You can limit extraction to a rectangular region (coordinates in points, 1 point = 1/72 inch):
PdfTextExtractOptions options = new PdfTextExtractOptions();
options.ExtractArea = new System.Drawing.RectangleF(50, 100, 400, 300); // left, top, width, height
string areaText = textExtractor.ExtractText(options);
Integrating with Other Features
In real‑world projects, text extraction is often just one piece of a larger data pipeline. Spire.PDF provides additional capabilities that can be combined seamlessly:
Text + Format Extraction: Use
PdfTextFinderto locate text based on style (font, color, size) – perfect for extracting titles or keywords.Table Data Extraction:
PdfTableExtractorretrieves structured table data as aDataTableor two‑dimensional array.OCR for Scanned PDFs: Pair with the Spire.OCR library to perform optical character recognition on image‑based PDFs.
Conclusion
Free Spire.PDF for .NET removes the common barriers to PDF text extraction: no external dependencies, simple API, and built‑in support for encrypted files and area‑specific extraction. With just a few lines of C# code, you can automate text retrieval from PDFs and extend it to handle tables, formatting, or even scanned documents. This makes it an excellent choice for .NET developers building efficient, maintainable document processing solutions.
