How to Extract Document Structure with Parsers in Python
How to Extract Document Structure with Parsers
Aspose.Words FOSS for Python provides parser classes for extracting structured data from DOCX documents. This guide covers NumberingParser for list numbering and StyleParser for document styles.
Prerequisites
Install the library:
pip install aspose-words-foss>=26.4.0Requires Python 3.10 or later.
Numbering Parser
NumberingParser reads list numbering definitions from a DOCX package. After calling parse_numbering_part(), you can query list properties:
NumberingParser.get_list_info()— retrieve information about a specific list by its IDNumberingParser.is_ordered_list()— check whether a list level is ordered or bulletedNumberingParser.get_start_value()— get the starting number for a list levelNumberingParser.get_delimiter()— get the delimiter string for a list level
Style Parser
StyleParser parses style names into structured ParsedStyle objects, identifying headings, blockquotes, code blocks, and list paragraphs:
StyleParser.parse()— parse a style name into aParsedStyleobjectStyleParser.get_style_chain()— parse a chain of style names for inherited stylesStyleParser.is_setext_heading()— check if a style is a Setext-style headingStyleParser.extract_all_styles()— extract individual style names from a comma-separated chain
Numbering Data Model
Parsed numbering data is stored in structured objects:
| Class | Key Properties |
|---|---|
NumberingInfo | num_id, abstract_num_id, levels |
NumberingLevel | format, start, text |
Summary
| Parser | Purpose |
|---|---|
NumberingParser | Extract list numbering definitions |
StyleParser | Parse style names into structured information |