How to Extract Text from PDFs in .NET

This guide covers text extraction from PDF pages using the absorber API.

Prerequisites

Requirement	Detail
Runtime	.NET 8.0 or later
Package	`dotnet add package Aspose.Pdf.Foss`

Extract all text from a page

TextAbsorber collects all text on a page into a single string. Create an instance, pass it to Page.Accept, and read the result from the Text property. This is the simplest approach when you need plain-text output without positional information.

using var doc = Document.Open(pdfBytes);
var absorber = new TextAbsorber();
doc.Pages[1].Accept(absorber);
Console.WriteLine(absorber.Text);

Extract structured text fragments

When you need position, font, and size metadata alongside the text, use TextFragmentAbsorber instead. Each TextFragment exposes its bounding coordinates through the Position property.

var absorber = new TextFragmentAbsorber();
doc.Pages[1].Accept(absorber);

foreach (var fragment in absorber.TextFragments)
{
    Console.WriteLine($"{fragment.Text} at ({fragment.Position.XIndent}, {fragment.Position.YIndent})");
}

Search with a regular expression

Pass a regex pattern to TextFragmentAbsorber to find matching text across all pages. This is useful for locating phone numbers, dates, or other structured data patterns within a document.

var absorber = new TextFragmentAbsorber(@"\d{3}-\d{2}-\d{4}");
doc.Pages.Accept(absorber);
foreach (var f in absorber.TextFragments)
    Console.WriteLine(f.Text);

Key Classes

Class	Purpose
`TextAbsorber`	Extract all text as a string
`TextFragmentAbsorber`	Structured text extraction
`TextFragment`	Text with position and font
`FontRepository`	Font lookup

How to Extract Text from PDFs in .NET