วิธีการสำรวจ OneNote Document DOM ใน Python

Aspose.Note FOSS for Python represents a OneNote section file as a tree of typed Python objects. Understanding how to traverse this tree efficiently is the foundation for all content extraction tasks. This guide covers all three traversal approaches: GetChildNodes, การวนซ้ำโดยตรง, และ DocumentVisitor.

โมเดลวัตถุเอกสาร

OneNote DOM เป็นต้นไม้ที่เข้มงวด:

Document
  ├── Page
  │     ├── Title
  │     │     ├── TitleText (RichText)
  │     │     ├── TitleDate (RichText)
  │     │     └── TitleTime (RichText)
  │     └── Outline
  │           └── OutlineElement
  │                 ├── RichText
  │                 ├── Image
  │                 ├── AttachedFile
  │                 └── Table
  │                       └── TableRow
  │                             └── TableCell
  │                                   └── RichText / Image
  └── Page  (next page ...)

โหนดทุกอันสืบทอดจาก Node. โหนดที่มีลูกสืบทอดจาก CompositeNode.

วิธีที่ 1: GetChildNodes (แบบเรียกซ้ำ, กรองตามประเภท)

CompositeNode.GetChildNodes(Type) ทำการค้นหาแบบ depth-first แบบวนซ้ำของ subtree ทั้งหมดและคืนรายการแบนของโหนดทั้งหมดที่ตรงกับประเภทที่กำหนด วิธีนี้เป็นวิธีที่สะดวกที่สุดสำหรับการสกัดเนื้อหา:

from aspose.note import Document, RichText, Image, Table, AttachedFile

doc = Document("MyNotes.one")

##All RichText nodes anywhere in the document
texts = doc.GetChildNodes(RichText)
print(f"RichText nodes: {len(texts)}")

##All images
images = doc.GetChildNodes(Image)
print(f"Image nodes: {len(images)}")

##All tables
tables = doc.GetChildNodes(Table)
print(f"Table nodes: {len(tables)}")

##All attachments
attachments = doc.GetChildNodes(AttachedFile)
print(f"AttachedFile nodes: {len(attachments)}")

จำกัดการค้นหาให้กับหน้าเดียวโดยเรียก GetChildNodes บน Page แทน Document:

from aspose.note import Document, Page, RichText

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    page_texts = page.GetChildNodes(RichText)
    print(f"  Page has {len(page_texts)} text nodes")

วิธีที่ 2: การวนซ้ำลูกโดยตรง

for child in node วนซ้ำผ่าน โดยตรง ลูกของ CompositeNode. ใช้สิ่งนี้เมื่อคุณต้องการระดับเฉพาะหนึ่งของโครงสร้างลำดับชั้น:

from aspose.note import Document

doc = Document("MyNotes.one")

##Direct children of Document are Pages
for page in doc:
    title = (
        page.Title.TitleText.Text
        if page.Title and page.Title.TitleText
        else "(untitled)"
    )
    print(f"Page: {title}")
    # Direct children of Page are Outlines (and optionally Title)
    for child in page:
        print(f"  {type(child).__name__}")

วิธีที่ 3: DocumentVisitor

DocumentVisitor ให้รูปแบบ visitor สำหรับการเดินทางแบบโครงสร้าง. Override เฉพาะเมธอดที่คุณต้องการ VisitXxxStart/End เมธอดที่คุณต้องการ. visitor จะถูก dispatch โดยการเรียก doc.Accept(visitor):

from aspose.note import (
    Document, DocumentVisitor, Page, Title,
    Outline, OutlineElement, RichText, Image,
)

class StructurePrinter(DocumentVisitor):
    def __init__(self):
        self._depth = 0

    def _indent(self):
        return "  " * self._depth

    def VisitPageStart(self, page: Page) -> None:
        t = page.Title.TitleText.Text if page.Title and page.Title.TitleText else "(untitled)"
        print(f"{self._indent()}Page: {t!r}")
        self._depth += 1

    def VisitPageEnd(self, page: Page) -> None:
        self._depth -= 1

    def VisitOutlineStart(self, outline) -> None:
        self._depth += 1

    def VisitOutlineEnd(self, outline) -> None:
        self._depth -= 1

    def VisitRichTextStart(self, rt: RichText) -> None:
        if rt.Text.strip():
            print(f"{self._indent()}Text: {rt.Text.strip()!r}")

    def VisitImageStart(self, img: Image) -> None:
        print(f"{self._indent()}Image: {img.FileName!r} ({img.Width}x{img.Height}pts)")

doc = Document("MyNotes.one")
doc.Accept(StructurePrinter())

เมธอด Visitor ที่มีให้

คู่เมธอด	ประเภทโหนด
`VisitDocumentStart/End`	`Document`
`VisitPageStart/End`	`Page`
`VisitTitleStart/End`	`Title`
`VisitOutlineStart/End`	`Outline`
`VisitOutlineElementStart/End`	`OutlineElement`
`VisitRichTextStart/End`	`RichText`
`VisitImageStart/End`	`Image`

การนำทางขึ้นต้นไม้

แต่ละโหนดเปิดเผย ParentNode และ Document คุณสมบัติสำหรับการนำทางขึ้นด้านบน:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    parent = rt.ParentNode   # OutlineElement, TableCell, Title, etc.
    root = rt.Document       # always the Document root
    print(f"  '{rt.Text.strip()!r}' parent={type(parent).__name__}")
    break

วิธีการจัดการโหนดลูก

CompositeNode ยังเปิดเผยการจัดการ child ในหน่วยความจำ (มีประโยชน์สำหรับการสร้างเอกสารแบบโปรแกรม, แม้ว่าการเขียนกลับไปยัง .one ไม่รองรับ):

เมธอด	คำอธิบาย
`node.FirstChild`	ลูกโดยตรงแรกหรือ `None`
`node.LastChild`	ลูกโดยตรงสุดท้ายหรือ `None`
`node.AppendChildLast(child)`	เพิ่มลูกที่ส่วนท้าย
`node.AppendChildFirst(child)`	เพิ่มลูกที่จุดเริ่มต้น
`node.InsertChild(index, child)`	แทรกที่ตำแหน่ง
`node.RemoveChild(child)`	ลบลูก

นับโหนดด้วย Visitor

from aspose.note import Document, DocumentVisitor, Page, RichText, Image

class Counter(DocumentVisitor):
    def __init__(self):
        self.pages = self.texts = self.images = 0

    def VisitPageStart(self, page: Page) -> None:
        self.pages += 1

    def VisitRichTextStart(self, rt: RichText) -> None:
        self.texts += 1

    def VisitImageStart(self, img: Image) -> None:
        self.images += 1

doc = Document("MyNotes.one")
c = Counter()
doc.Accept(c)
print(f"Pages={c.pages}  RichText={c.texts}  Images={c.images}")

การเลือกวิธี Traversal ที่เหมาะสม

สถานการณ์	แนวทางที่ดีที่สุด
ค้นหาโหนดทั้งหมดของประเภทเดียว (เช่น RichText ทั้งหมด)	`GetChildNodes(RichText)`
วนซ้ำเฉพาะลูกโดยตรง	`for child in node`
เดินผ่านต้นไม้พร้อมบริบท (ความลึก, สถานะของพาเรนท์)	`DocumentVisitor`
นำทางจากเนื้อหาไปยังโฟลเดอร์แม่หรือรูท	`node.ParentNode` / `node.Document`

แหล่งข้อมูลที่เกี่ยวข้อง: