Skip to main content

Command Palette

Search for a command to run...

C# Extract Text from PDF: A Complete Guide

Updated
4 min read

Extracting text from PDF files is a common requirement in both office automation and software development. Manually copying and pasting text is slow and impractical when handling large volumes of documents. Traditional automation approaches often rely on external software like Adobe Reader, which adds deployment complexity and struggles with encrypted or scanned PDFs.

This article demonstrates how to use Free Spire.PDF for .NET—a lightweight, dependency‑free library—to extract text from PDFs with high accuracy and reliability. We’ll walk through environment setup, core code, and advanced techniques, all with ready‑to‑run examples.


Why Choose Spire.PDF?

Traditional Solutions Spire.PDF Solution
Require installed software (Adobe Reader, etc.) Fully self‑contained .NET assembly
No or limited support for encrypted PDFs Load password‑protected files with a single parameter
Complex COM interop, verbose code Pure .NET API, clean and intuitive
Sparse or scattered documentation Comprehensive API reference and examples

Step‑by‑Step Tutorial

1. Environment Setup

Create a new .NET Console Application (supports .NET Framework 4.6.1+ or .NET Core 3.1+). Then install the FreeSpire.PDF NuGet package:

Install-Package FreeSpire.PDF

Note: The free version processes up to 10 pages per document—ideal for personal or small‑scale projects.

2. Extract Text from a Single Page

The following example loads a PDF, extracts all text from a specific page, and saves it to a .txt file.

using System.IO;
using Spire.Pdf;
using Spire.Pdf.Texts;

namespace ExtractTextFromPage
{
    class Program
    {
        static void Main(string[] args)
        {
            // 1. Load the PDF document
            PdfDocument doc = new PdfDocument();
            doc.LoadFromFile("Sample.pdf");

            // 2. Get the desired page (0‑based index: Pages[1] = second page)
            PdfPageBase page = doc.Pages[1];

            // 3. Create a text extractor for the page
            PdfTextExtractor textExtractor = new PdfTextExtractor(page);

            // 4. Configure extraction options – here we extract all text
            PdfTextExtractOptions extractOptions = new PdfTextExtractOptions
            {
                IsExtractAllText = true
            };

            // 5. Perform extraction
            string text = textExtractor.ExtractText(extractOptions);

            // 6. Write the result to a file
            File.WriteAllText("Extracted_Single_Page_Text.txt", text);

            // 7. Clean up
            doc.Close();
        }
    }
}

Key points:

  • PdfTextExtractor is bound to a specific page.

  • PdfTextExtractOptions lets you control the extraction area (entire page or a rectangle).

  • ExtractText() returns a string containing all text from that page.

3. Advanced Techniques

Handling Encrypted PDFs

If the PDF is password‑protected, supply the password when loading:

doc.LoadFromFile("Encrypted.pdf", "password");

Extracting Text from All Pages

Loop through every page and concatenate the results:

StringBuilder allText = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
    PdfTextExtractor extractor = new PdfTextExtractor(page);
    PdfTextExtractOptions options = new PdfTextExtractOptions
    {
        IsExtractAllText = true
    };
    allText.AppendLine(extractor.ExtractText(options));
}
File.WriteAllText("Extracted_All_Pages_Text.txt", allText.ToString());

Extracting Text from a Specific Area

You can limit extraction to a rectangular region (coordinates in points, 1 point = 1/72 inch):

PdfTextExtractOptions options = new PdfTextExtractOptions();
options.ExtractArea = new System.Drawing.RectangleF(50, 100, 400, 300); // left, top, width, height
string areaText = textExtractor.ExtractText(options);

Integrating with Other Features

In real‑world projects, text extraction is often just one piece of a larger data pipeline. Spire.PDF provides additional capabilities that can be combined seamlessly:

  • Text + Format Extraction: Use PdfTextFinder to locate text based on style (font, color, size) – perfect for extracting titles or keywords.

  • Table Data Extraction: PdfTableExtractor retrieves structured table data as a DataTable or two‑dimensional array.

  • OCR for Scanned PDFs: Pair with the Spire.OCR library to perform optical character recognition on image‑based PDFs.


Conclusion

Free Spire.PDF for .NET removes the common barriers to PDF text extraction: no external dependencies, simple API, and built‑in support for encrypted files and area‑specific extraction. With just a few lines of C# code, you can automate text retrieval from PDFs and extend it to handle tables, formatting, or even scanned documents. This makes it an excellent choice for .NET developers building efficient, maintainable document processing solutions.