How to Extract Text from PDFs in .NET

How to Extract Text from PDFs in .NET

How to Extract Text from PDFs in .NET

This guide covers text extraction from PDF pages using the absorber API.


Prerequisites

RequirementDetail
Runtime.NET 8.0 or later
Packagedotnet add package Aspose.Pdf.Foss

Extract all text from a page

TextAbsorber collects all text on a page into a single string. Create an instance, pass it to Page.Accept, and read the result from the Text property. This is the simplest approach when you need plain-text output without positional information.

using var doc = Document.Open(pdfBytes);
var absorber = new TextAbsorber();
doc.Pages[1].Accept(absorber);
Console.WriteLine(absorber.Text);

Extract structured text fragments

When you need position, font, and size metadata alongside the text, use TextFragmentAbsorber instead. Each TextFragment exposes its bounding coordinates through the Position property.

var absorber = new TextFragmentAbsorber();
doc.Pages[1].Accept(absorber);

foreach (var fragment in absorber.TextFragments)
{
    Console.WriteLine($"{fragment.Text} at ({fragment.Position.XIndent}, {fragment.Position.YIndent})");
}

Search with a regular expression

Pass a regex pattern to TextFragmentAbsorber to find matching text across all pages. This is useful for locating phone numbers, dates, or other structured data patterns within a document.

var absorber = new TextFragmentAbsorber(@"\d{3}-\d{2}-\d{4}");
doc.Pages.Accept(absorber);
foreach (var f in absorber.TextFragments)
    Console.WriteLine(f.Text);

Key Classes

ClassPurpose
TextAbsorberExtract all text as a string
TextFragmentAbsorberStructured text extraction
TextFragmentText with position and font
FontRepositoryFont lookup

See Also