How to Extract Text from PDFs in .NET
How to Extract Text from PDFs in .NET
This guide covers text extraction from PDF pages using the absorber API.
Prerequisites
| Requirement | Detail |
|---|---|
| Runtime | .NET 8.0 or later |
| Package | dotnet add package Aspose.Pdf.Foss |
Extract all text from a page
TextAbsorber collects all text on a page into a single string. Create an
instance, pass it to Page.Accept, and read the result from the Text property.
This is the simplest approach when you need plain-text output without positional
information.
using var doc = Document.Open(pdfBytes);
var absorber = new TextAbsorber();
doc.Pages[1].Accept(absorber);
Console.WriteLine(absorber.Text);Extract structured text fragments
When you need position, font, and size metadata alongside the text,
use TextFragmentAbsorber instead. Each TextFragment exposes its
bounding coordinates through the Position property.
var absorber = new TextFragmentAbsorber();
doc.Pages[1].Accept(absorber);
foreach (var fragment in absorber.TextFragments)
{
Console.WriteLine($"{fragment.Text} at ({fragment.Position.XIndent}, {fragment.Position.YIndent})");
}Search with a regular expression
Pass a regex pattern to TextFragmentAbsorber to find matching text across
all pages. This is useful for locating phone numbers, dates, or other
structured data patterns within a document.
var absorber = new TextFragmentAbsorber(@"\d{3}-\d{2}-\d{4}");
doc.Pages.Accept(absorber);
foreach (var f in absorber.TextFragments)
Console.WriteLine(f.Text);Key Classes
| Class | Purpose |
|---|---|
TextAbsorber | Extract all text as a string |
TextFragmentAbsorber | Structured text extraction |
TextFragment | Text with position and font |
FontRepository | Font lookup |