How to Extract Document Structure with Parsers in Python

How to Extract Document Structure with Parsers in Python

How to Extract Document Structure with Parsers

Aspose.Words FOSS for Python provides parser classes for extracting structured data from DOCX documents. This guide covers NumberingParser for list numbering and StyleParser for document styles.

Prerequisites

Install the library:

pip install aspose-words-foss>=26.4.0

Requires Python 3.10 or later.

Numbering Parser

NumberingParser reads list numbering definitions from a DOCX package. After calling parse_numbering_part(), you can query list properties:

  • NumberingParser.get_list_info() — retrieve information about a specific list by its ID
  • NumberingParser.is_ordered_list() — check whether a list level is ordered or bulleted
  • NumberingParser.get_start_value() — get the starting number for a list level
  • NumberingParser.get_delimiter() — get the delimiter string for a list level

Style Parser

StyleParser parses style names into structured ParsedStyle objects, identifying headings, blockquotes, code blocks, and list paragraphs:

  • StyleParser.parse() — parse a style name into a ParsedStyle object
  • StyleParser.get_style_chain() — parse a chain of style names for inherited styles
  • StyleParser.is_setext_heading() — check if a style is a Setext-style heading
  • StyleParser.extract_all_styles() — extract individual style names from a comma-separated chain

Numbering Data Model

Parsed numbering data is stored in structured objects:

ClassKey Properties
NumberingInfonum_id, abstract_num_id, levels
NumberingLevelformat, start, text

Summary

ParserPurpose
NumberingParserExtract list numbering definitions
StyleParserParse style names into structured information

See Also

 English