How to Extract Document Structure with Parsers in Python

How to Extract Document Structure with Parsers

Aspose.Words FOSS for Python provides parser classes for extracting structured data from DOCX documents. This guide covers NumberingParser for list numbering and StyleParser for document styles.

Prerequisites

Install the library:

pip install aspose-words-foss>=26.4.0

Requires Python 3.10 or later.

Numbering Parser

NumberingParser reads list numbering definitions from a DOCX package. After calling parse_numbering_part(), you can query list properties:

NumberingParser.get_list_info() — retrieve information about a specific list by its ID
NumberingParser.is_ordered_list() — check whether a list level is ordered or bulleted
NumberingParser.get_start_value() — get the starting number for a list level
NumberingParser.get_delimiter() — get the delimiter string for a list level

Style Parser

StyleParser parses style names into structured ParsedStyle objects, identifying headings, blockquotes, code blocks, and list paragraphs:

StyleParser.parse() — parse a style name into a ParsedStyle object
StyleParser.get_style_chain() — parse a chain of style names for inherited styles
StyleParser.is_setext_heading() — check if a style is a Setext-style heading
StyleParser.extract_all_styles() — extract individual style names from a comma-separated chain

Numbering Data Model

Parsed numbering data is stored in structured objects:

Class	Key Properties
`NumberingInfo`	`num_id`, `abstract_num_id`, `levels`
`NumberingLevel`	`format`, `start`, `text`

Summary

Parser	Purpose
`NumberingParser`	Extract list numbering definitions
`StyleParser`	Parse style names into structured information

How to Extract Document Structure with Parsers in Python