X (Twitter)

Verwenden Sie diese Fähigkeit immer dann, wenn eine PPTX-Datei in irgendeiner Weise involviert ist – als Eingabe, Ausgabe oder beides. Dazu gehört: Erstellen von Slide-Decks, Pitch-Decks oder Präsentationen; Lesen, Analysieren oder Extrahieren von Text aus einer PPTX-Datei (auch wenn der extrahierte Inhalt an anderer Stelle verwendet wird, z. B. in einer E-Mail oder Zusammenfassung); Bearbeiten, Modifizieren oder Aktualisieren bestehender Präsentationen; Kombinieren oder Teilen von Foliendateien; Arbeiten mit Vorlagen, Layouts, Sprechernotizen oder Kommentaren. Wird immer dann ausgelöst, wenn der Benutzer „Deck“, „Folien“, „Präsentation“ erwähnt oder auf einen PPTX-Dateinamen verweist, unabhängig davon, was er anschließend mit dem Inhalt vorhat. Wenn eine PPTX-Datei geöffnet, erstellt oder bearbeitet werden muss, verwenden Sie diese Fähigkeit.

Quelle: Inhalt angepasst von Anthropics/Skills (MIT).

Kurzreferenz

Aufgabe	Leitfaden
Inhalte lesen/analysieren	`python -m markitdown presentation.pptx`
Bearbeiten oder aus Vorlage erstellen	Lesen Sie editing.md
Von Grund auf neu erstellen	Lesen Sie pptxgenjs.md

Inhalte lesen

# Text extraction
python -m markitdown presentation.pptx

# Visual overview
python scripts/thumbnail.py presentation.pptx

# Raw XML
python scripts/office/unpack.py presentation.pptx unpacked/

Bearbeitungsworkflow

Weitere Informationen finden Sie unter editing.md.

Vorlage mitthumbnail.pyanalysieren
Auspacken -> Folien bearbeiten -> Inhalt bearbeiten -> Bereinigen -> Packen

Von Grund auf neu erstellen

Lesen Sie pptxgenjs.md für vollständige Details.

Verwenden Sie diese Option, wenn keine Vorlage oder Referenzpräsentation verfügbar ist.

Designideen

Erstellen Sie keine langweiligen Folien. Einfache Aufzählungszeichen auf weißem Hintergrund werden niemanden beeindrucken. Berücksichtigen Sie für jede Folie Ideen aus dieser Liste.

Vor dem Start

Wählen Sie eine kräftige, inhaltsorientierte Farbpalette: Die Palette sollte für DIESES Thema konzipiert sein. Wenn der Austausch Ihrer Farben in eine völlig andere Präsentation immer noch „funktionieren“ würde, haben Sie nicht ausreichend konkrete Entscheidungen getroffen.
Dominanz vor Gleichheit: Eine Farbe sollte dominieren (60-70 % visuelles Gewicht), mit 1-2 unterstützenden Tönen und einem scharfen Akzent. Geben Sie niemals allen Farben das gleiche Gewicht.
Dunkel-Hell-Kontrast: Dunkle Hintergründe für Titel- und Abschlussfolien, hell für Inhalte („Sandwich“-Struktur). Oder entscheiden Sie sich für ein erstklassiges Gefühl für eine durchgehend dunkle Farbe.
Legen Sie sich auf ein visuelles Motiv fest: Wählen Sie EIN markantes Element und wiederholen Sie es – abgerundete Bildrahmen, Symbole in farbigen Kreisen, dicke einseitige Ränder. Tragen Sie es über jede Folie.

Farbpaletten

Wählen Sie Farben, die zu Ihrem Thema passen – verwenden Sie nicht standardmäßig Blau. Nutzen Sie diese Paletten als Inspiration:

Thema	Primär	Sekundär	Akzent
Midnight Executive	`1E2761`(Marine)	`CADCFC`(eisblau)	`FFFFFF`(weiß)
Wald & Moos	`2C5F2D`(Wald)	`97BC62`(Moos)	`F5F5F5`(Creme)
Korallenenergie	`F96167`(Koralle)	`F9E795`(Gold)	`2F3C7E`(Marine)
Warmes Terrakotta	`B85042`(Terrakotta)	`E7E8D1`(Sand)	`A7BEAE`(Salbei)
Meeresgefälle	`065A82`(tiefblau)	`1C7293`(blaugrün)	`21295C`(Mitternacht)
Kohle Minimal	`36454F`(Holzkohle)	`F2F2F2`(cremefarben)	`212121`(schwarz)
Teal Trust	`028090`(blaugrün)	`00A896`(Seeschaum)	`02C39A`(Minze)
Beere & Sahne	`6D2E46`(Beere)	`A26769`(staubige Rose)	`ECE2D0`(Creme)
Salbei ruhig	`84B59F`(Salbei)	`69A297`(Eukalyptus)	`50808E`(Schiefer)
Cherry Bold	`990011`(Kirsche)	`FCF6F5`(cremefarben)	`2F3C7E`(Marine)

Für jede Folie

Jede Folie benötigt ein visuelles Element – Bild, Diagramm, Symbol oder Form. Folien, die nur aus Text bestehen, kann man vergessen.

Layoutoptionen:

Zweispaltig (Text links, Abbildung rechts)
Symbol + Textzeilen (Symbol im farbigen Kreis, fett gedruckte Überschrift, Beschreibung unten)
2x2- oder 2x3-Raster (Bild auf einer Seite, Raster mit Inhaltsblöcken auf der anderen)
Halbrandiges Bild (ganz links oder rechts) mit Inhaltsüberlagerung

Datenanzeige:

Große Statistikbeschriftungen (große Zahlen 60–72 pt mit kleinen Beschriftungen unten)
Vergleichsspalten (Vorher/Nachher, Vor-/Nachteile, Nebeneinander-Optionen)
Zeitleiste oder Prozessablauf (nummerierte Schritte, Pfeile)

Optischer Schliff:

Symbole in kleinen farbigen Kreisen neben Abschnittsüberschriften
Kursiver Akzenttext für wichtige Statistiken oder Slogans

Typografie

Wählen Sie eine interessante Schriftartenpaarung – verwenden Sie nicht standardmäßig Arial. Wählen Sie eine Header-Schriftart mit Persönlichkeit und kombinieren Sie sie mit einer klaren Schriftart für den Textkörper.

Header-Schriftart	Körperschrift
Georgien	Calibri
Arial Schwarz	Arial
Calibri	Calibri Light
Cambria	Calibri
Trebuchet MS	Calibri
Auswirkungen	Arial
Palatino	Garamond
Konsolen	Calibri

Element	Größe
Folientitel	36-44pt fett
Abschnittsüberschrift	20-24pt fett
Fließtext	14-16pt
Bildunterschriften	10-12pt gedämpft

Abstand

Mindestränder von 0,5 Zoll
0,3–0,5 Zoll zwischen den Inhaltsblöcken
Lassen Sie Raum zum Atmen – füllen Sie nicht jeden Zentimeter aus

Vermeiden Sie (häufige Fehler)

Wiederholen Sie nicht dasselbe Layout – variieren Sie Spalten, Karten und Beschriftungen auf den einzelnen Folien
Körpertext nicht zentrieren – Absätze und Listen linksbündig ausrichten; Nur Titel zentrieren
Sparen Sie nicht beim Größenkontrast – Titel benötigen 36pt+, um sich vom 14-16pt-Text abzuheben
Nicht standardmäßig Blau verwenden – wählen Sie Farben aus, die das spezifische Thema widerspiegeln
Mischen Sie die Abstände nicht willkürlich – wählen Sie Abstände von 0,3" oder 0,5" und verwenden Sie sie konsequent
Stileln Sie nicht eine Folie und lassen Sie den Rest schlicht – legen Sie fest oder halten Sie es durchgehend einfach
Erstellen Sie keine Folien, die nur aus Text bestehen – fügen Sie Bilder, Symbole, Diagramme oder visuelle Elemente hinzu; Vermeiden Sie einfache Titel + Aufzählungszeichen
Vergessen Sie nicht den Textfeldabstand – wenn Sie Linien oder Formen an Textkanten ausrichten, legen Siemargin: 0für das Textfeld fest oder versetzen Sie die Form, um den Abstand zu berücksichtigen
Verwenden Sie keine kontrastarmen Elemente – Symbole UND Text benötigen einen starken Kontrast zum Hintergrund; Vermeiden Sie hellen Text auf hellem Hintergrund oder dunklen Text auf dunklem Hintergrund
Verwenden Sie NIEMALS Akzentlinien unter Titeln – diese sind ein Markenzeichen von KI-generierten Folien; Verwenden Sie stattdessen Leerzeichen oder Hintergrundfarbe

Qualitätssicherung (erforderlich)

Geht davon aus, dass es Probleme gibt. Ihre Aufgabe ist es, sie zu finden.

Ihr erstes Rendering ist fast nie korrekt. Betrachten Sie die Qualitätssicherung als eine Fehlersuche und nicht als einen Bestätigungsschritt. Wenn Sie bei der ersten Inspektion keinerlei Probleme festgestellt haben, haben Sie nicht genau genug gesucht.

Inhaltsqualitätssicherung

python -m markitdown output.pptx

Suchen Sie nach fehlenden Inhalten, Tippfehlern und falscher Reihenfolge.

Wenn Sie Vorlagen verwenden, prüfen Sie, ob noch Platzhaltertext übrig ist:

python -m markitdown output.pptx | grep -iE "xxxx|lorem|ipsum|this.*(page|slide).*layout"

Wenn grep Ergebnisse zurückgibt, korrigieren Sie diese, bevor Sie den Erfolg melden.

Visuelle Qualitätssicherung

** VERWENDEN SIE UNTERGENTEN** – auch für 2-3 Folien. Sie haben auf den Code gestarrt und werden sehen, was Sie erwarten, nicht, was da ist. Subagenten haben neue Augen.

Konvertieren Sie Folien in Bilder (siehe In Bilder konvertieren), dann verwenden Sie diese Eingabeaufforderung:

Visually inspect these slides. Assume there are issues - find them.

Look for:
- Overlapping elements (text through shapes, lines through words, stacked elements)
- Text overflow or cut off at edges/box boundaries
- Decorative lines positioned for single-line text but title wrapped to two lines
- Source citations or footers colliding with content above
- Elements too close (< 0.3" gaps) or cards/sections nearly touching
- Uneven gaps (large empty area in one place, cramped in another)
- Insufficient margin from slide edges (< 0.5")
- Columns or similar elements not aligned consistently
- Low-contrast text (e.g., light gray text on cream-colored background)
- Low-contrast icons (e.g., dark icons on dark backgrounds without a contrasting circle)
- Text boxes too narrow causing excessive wrapping
- Leftover placeholder content

For each slide, list issues or areas of concern, even if minor.

Read and analyze these images:
1. /path/to/slide-01.jpg (Expected: [brief description])
2. /path/to/slide-02.jpg (Expected: [brief description])

Report ALL issues found, including minor ones.

Verifizierungsschleife

Folien erstellen -> In Bilder konvertieren -> Überprüfen
Gefundene Probleme auflisten (Wenn keine gefunden werden, schauen Sie noch einmal kritischer nach)
Beheben Sie Probleme
Betroffene Folien erneut überprüfen – ein Fix führt oft zu einem weiteren Problem
Wiederholen Sie diesen Vorgang, bis ein vollständiger Durchgang keine neuen Probleme mehr offenbart

Erklären Sie den Erfolg erst, wenn Sie mindestens einen Korrektur- und Überprüfungszyklus abgeschlossen haben.

Konvertieren in Bilder

Konvertieren Sie Präsentationen zur visuellen Überprüfung in einzelne Folienbilder:

python scripts/office/soffice.py --headless --convert-to pdf output.pptx
pdftoppm -jpeg -r 150 output.pdf slide

Dadurch werdenslide-01.jpg,slide-02.jpgusw. erstellt.

So rendern Sie bestimmte Folien nach Korrekturen erneut:

pdftoppm -jpeg -r 150 -f N -l N output.pdf slide-fixed

Abhängigkeiten

pip install "markitdown[pptx]"– Textextraktion
pip install Pillow– Miniaturbildraster
npm install -g pptxgenjs– von Grund auf neu erstellen
LibreOffice (soffice) – PDF-Konvertierung (automatisch konfiguriert für Sandbox-Umgebungen überscripts/office/soffice.py)
Poppler (pdftoppm) – PDF zu Bildern

Ressourcendateien

LIZENZ.txt

LICENSE.txt herunterladen

Binäre Ressource

Bearbeitung.md

editing.md herunterladen

Binäre Ressource

pptxgenjs.md

pptxgenjs.md herunterladen

Binäre Ressource

scripts/init.py

Scripts/init.py herunterladen

Binäre Ressource

scripts/add_slide.py

Download scripts/add_slide.py

Binäre Ressource

scripts/clean.py

scripts/clean.py herunterladen

Binäre Ressource

scripts/office/helpers/init.py

Scripts/office/helpers/init.py herunterladen

Binäre Ressource

scripts/office/helpers/merge_runs.py

Download scripts/office/helpers/merge_runs.py

"""Merge adjacent runs with identical formatting in DOCX.

Merges adjacent <w:r> elements that have identical <w:rPr> properties.
Works on runs in paragraphs and inside tracked changes (<w:ins>, <w:del>).

Also:
- Removes rsid attributes from runs (revision metadata that doesn't affect rendering)
- Removes proofErr elements (spell/grammar markers that block merging)
"""

from pathlib import Path

import defusedxml.minidom


def merge_runs(input_dir: str) -> tuple[int, str]:
    doc_xml = Path(input_dir) / "word" / "document.xml"

    if not doc_xml.exists():
        return 0, f"Error: {doc_xml} not found"

    try:
        dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
        root = dom.documentElement

        _remove_elements(root, "proofErr")
        _strip_run_rsid_attrs(root)

        containers = {run.parentNode for run in _find_elements(root, "r")}

        merge_count = 0
        for container in containers:
            merge_count += _merge_runs_in(container)

        doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
        return merge_count, f"Merged {merge_count} runs"

    except Exception as e:
        return 0, f"Error: {e}"




def _find_elements(root, tag: str) -> list:
    results = []

    def traverse(node):
        if node.nodeType == node.ELEMENT_NODE:
            name = node.localName or node.tagName
            if name == tag or name.endswith(f":{tag}"):
                results.append(node)
            for child in node.childNodes:
                traverse(child)

    traverse(root)
    return results


def _get_child(parent, tag: str):
    for child in parent.childNodes:
        if child.nodeType == child.ELEMENT_NODE:
            name = child.localName or child.tagName
            if name == tag or name.endswith(f":{tag}"):
                return child
    return None


def _get_children(parent, tag: str) -> list:
    results = []
    for child in parent.childNodes:
        if child.nodeType == child.ELEMENT_NODE:
            name = child.localName or child.tagName
            if name == tag or name.endswith(f":{tag}"):
                results.append(child)
    return results


def _is_adjacent(elem1, elem2) -> bool:
    node = elem1.nextSibling
    while node:
        if node == elem2:
            return True
        if node.nodeType == node.ELEMENT_NODE:
            return False
        if node.nodeType == node.TEXT_NODE and node.data.strip():
            return False
        node = node.nextSibling
    return False




def _remove_elements(root, tag: str):
    for elem in _find_elements(root, tag):
        if elem.parentNode:
            elem.parentNode.removeChild(elem)


def _strip_run_rsid_attrs(root):
    for run in _find_elements(root, "r"):
        for attr in list(run.attributes.values()):
            if "rsid" in attr.name.lower():
                run.removeAttribute(attr.name)




def _merge_runs_in(container) -> int:
    merge_count = 0
    run = _first_child_run(container)

    while run:
        while True:
            next_elem = _next_element_sibling(run)
            if next_elem and _is_run(next_elem) and _can_merge(run, next_elem):
                _merge_run_content(run, next_elem)
                container.removeChild(next_elem)
                merge_count += 1
            else:
                break

        _consolidate_text(run)
        run = _next_sibling_run(run)

    return merge_count


def _first_child_run(container):
    for child in container.childNodes:
        if child.nodeType == child.ELEMENT_NODE and _is_run(child):
            return child
    return None


def _next_element_sibling(node):
    sibling = node.nextSibling
    while sibling:
        if sibling.nodeType == sibling.ELEMENT_NODE:
            return sibling
        sibling = sibling.nextSibling
    return None


def _next_sibling_run(node):
    sibling = node.nextSibling
    while sibling:
        if sibling.nodeType == sibling.ELEMENT_NODE:
            if _is_run(sibling):
                return sibling
        sibling = sibling.nextSibling
    return None


def _is_run(node) -> bool:
    name = node.localName or node.tagName
    return name == "r" or name.endswith(":r")


def _can_merge(run1, run2) -> bool:
    rpr1 = _get_child(run1, "rPr")
    rpr2 = _get_child(run2, "rPr")

    if (rpr1 is None) != (rpr2 is None):
        return False
    if rpr1 is None:
        return True
    return rpr1.toxml() == rpr2.toxml()  


def _merge_run_content(target, source):
    for child in list(source.childNodes):
        if child.nodeType == child.ELEMENT_NODE:
            name = child.localName or child.tagName
            if name != "rPr" and not name.endswith(":rPr"):
                target.appendChild(child)


def _consolidate_text(run):
    t_elements = _get_children(run, "t")

    for i in range(len(t_elements) - 1, 0, -1):
        curr, prev = t_elements[i], t_elements[i - 1]

        if _is_adjacent(prev, curr):
            prev_text = prev.firstChild.data if prev.firstChild else ""
            curr_text = curr.firstChild.data if curr.firstChild else ""
            merged = prev_text + curr_text

            if prev.firstChild:
                prev.firstChild.data = merged
            else:
                prev.appendChild(run.ownerDocument.createTextNode(merged))

            if merged.startswith(" ") or merged.endswith(" "):
                prev.setAttribute("xml:space", "preserve")
            elif prev.hasAttribute("xml:space"):
                prev.removeAttribute("xml:space")

            run.removeChild(curr)

scripts/office/helpers/simplify_redlines.py

Download scripts/office/helpers/simplify_redlines.py

"""Simplify tracked changes by merging adjacent w:ins or w:del elements.

Merges adjacent <w:ins> elements from the same author into a single element.
Same for <w:del> elements. This makes heavily-redlined documents easier to
work with by reducing the number of tracked change wrappers.

Rules:
- Only merges w:ins with w:ins, w:del with w:del (same element type)
- Only merges if same author (ignores timestamp differences)
- Only merges if truly adjacent (only whitespace between them)
"""

import xml.etree.ElementTree as ET
import zipfile
from pathlib import Path

import defusedxml.minidom

WORD_NS = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"


def simplify_redlines(input_dir: str) -> tuple[int, str]:
    doc_xml = Path(input_dir) / "word" / "document.xml"

    if not doc_xml.exists():
        return 0, f"Error: {doc_xml} not found"

    try:
        dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
        root = dom.documentElement

        merge_count = 0

        containers = _find_elements(root, "p") + _find_elements(root, "tc")

        for container in containers:
            merge_count += _merge_tracked_changes_in(container, "ins")
            merge_count += _merge_tracked_changes_in(container, "del")

        doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
        return merge_count, f"Simplified {merge_count} tracked changes"

    except Exception as e:
        return 0, f"Error: {e}"


def _merge_tracked_changes_in(container, tag: str) -> int:
    merge_count = 0

    tracked = [
        child
        for child in container.childNodes
        if child.nodeType == child.ELEMENT_NODE and _is_element(child, tag)
    ]

    if len(tracked) < 2:
        return 0

    i = 0
    while i < len(tracked) - 1:
        curr = tracked[i]
        next_elem = tracked[i + 1]

        if _can_merge_tracked(curr, next_elem):
            _merge_tracked_content(curr, next_elem)
            container.removeChild(next_elem)
            tracked.pop(i + 1)
            merge_count += 1
        else:
            i += 1

    return merge_count


def _is_element(node, tag: str) -> bool:
    name = node.localName or node.tagName
    return name == tag or name.endswith(f":{tag}")


def _get_author(elem) -> str:
    author = elem.getAttribute("w:author")
    if not author:
        for attr in elem.attributes.values():
            if attr.localName == "author" or attr.name.endswith(":author"):
                return attr.value
    return author


def _can_merge_tracked(elem1, elem2) -> bool:
    if _get_author(elem1) != _get_author(elem2):
        return False

    node = elem1.nextSibling
    while node and node != elem2:
        if node.nodeType == node.ELEMENT_NODE:
            return False
        if node.nodeType == node.TEXT_NODE and node.data.strip():
            return False
        node = node.nextSibling

    return True


def _merge_tracked_content(target, source):
    while source.firstChild:
        child = source.firstChild
        source.removeChild(child)
        target.appendChild(child)


def _find_elements(root, tag: str) -> list:
    results = []

    def traverse(node):
        if node.nodeType == node.ELEMENT_NODE:
            name = node.localName or node.tagName
            if name == tag or name.endswith(f":{tag}"):
                results.append(node)
            for child in node.childNodes:
                traverse(child)

    traverse(root)
    return results


def get_tracked_change_authors(doc_xml_path: Path) -> dict[str, int]:
    if not doc_xml_path.exists():
        return {}

    try:
        tree = ET.parse(doc_xml_path)
        root = tree.getroot()
    except ET.ParseError:
        return {}

    namespaces = {"w": WORD_NS}
    author_attr = f"{{{WORD_NS}}}author"

    authors: dict[str, int] = {}
    for tag in ["ins", "del"]:
        for elem in root.findall(f".//w:{tag}", namespaces):
            author = elem.get(author_attr)
            if author:
                authors[author] = authors.get(author, 0) + 1

    return authors


def _get_authors_from_docx(docx_path: Path) -> dict[str, int]:
    try:
        with zipfile.ZipFile(docx_path, "r") as zf:
            if "word/document.xml" not in zf.namelist():
                return {}
            with zf.open("word/document.xml") as f:
                tree = ET.parse(f)
                root = tree.getroot()

                namespaces = {"w": WORD_NS}
                author_attr = f"{{{WORD_NS}}}author"

                authors: dict[str, int] = {}
                for tag in ["ins", "del"]:
                    for elem in root.findall(f".//w:{tag}", namespaces):
                        author = elem.get(author_attr)
                        if author:
                            authors[author] = authors.get(author, 0) + 1
                return authors
    except (zipfile.BadZipFile, ET.ParseError):
        return {}


def infer_author(modified_dir: Path, original_docx: Path, default: str = "Claude") -> str:
    modified_xml = modified_dir / "word" / "document.xml"
    modified_authors = get_tracked_change_authors(modified_xml)

    if not modified_authors:
        return default

    original_authors = _get_authors_from_docx(original_docx)

    new_changes: dict[str, int] = {}
    for author, count in modified_authors.items():
        original_count = original_authors.get(author, 0)
        diff = count - original_count
        if diff > 0:
            new_changes[author] = diff

    if not new_changes:
        return default

    if len(new_changes) == 1:
        return next(iter(new_changes))

    raise ValueError(
        f"Multiple authors added new changes: {new_changes}. "
        "Cannot infer which author to validate."
    )

scripts/office/pack.py

Scripts/office/pack.py herunterladen

"""Pack a directory into a DOCX, PPTX, or XLSX file.

Validates with auto-repair, condenses XML formatting, and creates the Office file.

Usage:
    python pack.py <input_directory> <output_file> [--original <file>] [--validate true|false]

Examples:
    python pack.py unpacked/ output.docx --original input.docx
    python pack.py unpacked/ output.pptx --validate false
"""

import argparse
import sys
import shutil
import tempfile
import zipfile
from pathlib import Path

import defusedxml.minidom

from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator

def pack(
    input_directory: str,
    output_file: str,
    original_file: str | None = None,
    validate: bool = True,
    infer_author_func=None,
) -> tuple[None, str]:
    input_dir = Path(input_directory)
    output_path = Path(output_file)
    suffix = output_path.suffix.lower()

    if not input_dir.is_dir():
        return None, f"Error: {input_dir} is not a directory"

    if suffix not in {".docx", ".pptx", ".xlsx"}:
        return None, f"Error: {output_file} must be a .docx, .pptx, or .xlsx file"

    if validate and original_file:
        original_path = Path(original_file)
        if original_path.exists():
            success, output = _run_validation(
                input_dir, original_path, suffix, infer_author_func
            )
            if output:
                print(output)
            if not success:
                return None, f"Error: Validation failed for {input_dir}"

    with tempfile.TemporaryDirectory() as temp_dir:
        temp_content_dir = Path(temp_dir) / "content"
        shutil.copytree(input_dir, temp_content_dir)

        for pattern in ["*.xml", "*.rels"]:
            for xml_file in temp_content_dir.rglob(pattern):
                _condense_xml(xml_file)

        output_path.parent.mkdir(parents=True, exist_ok=True)
        with zipfile.ZipFile(output_path, "w", zipfile.ZIP_DEFLATED) as zf:
            for f in temp_content_dir.rglob("*"):
                if f.is_file():
                    zf.write(f, f.relative_to(temp_content_dir))

    return None, f"Successfully packed {input_dir} to {output_file}"


def _run_validation(
    unpacked_dir: Path,
    original_file: Path,
    suffix: str,
    infer_author_func=None,
) -> tuple[bool, str | None]:
    output_lines = []
    validators = []

    if suffix == ".docx":
        author = "Claude"
        if infer_author_func:
            try:
                author = infer_author_func(unpacked_dir, original_file)
            except ValueError as e:
                print(f"Warning: {e} Using default author 'Claude'.", file=sys.stderr)

        validators = [
            DOCXSchemaValidator(unpacked_dir, original_file),
            RedliningValidator(unpacked_dir, original_file, author=author),
        ]
    elif suffix == ".pptx":
        validators = [PPTXSchemaValidator(unpacked_dir, original_file)]

    if not validators:
        return True, None

    total_repairs = sum(v.repair() for v in validators)
    if total_repairs:
        output_lines.append(f"Auto-repaired {total_repairs} issue(s)")

    success = all(v.validate() for v in validators)

    if success:
        output_lines.append("All validations PASSED!")

    return success, "\n".join(output_lines) if output_lines else None


def _condense_xml(xml_file: Path) -> None:
    try:
        with open(xml_file, encoding="utf-8") as f:
            dom = defusedxml.minidom.parse(f)

        for element in dom.getElementsByTagName("*"):
            if element.tagName.endswith(":t"):
                continue

            for child in list(element.childNodes):
                if (
                    child.nodeType == child.TEXT_NODE
                    and child.nodeValue
                    and child.nodeValue.strip() == ""
                ) or child.nodeType == child.COMMENT_NODE:
                    element.removeChild(child)

        xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
    except Exception as e:
        print(f"ERROR: Failed to parse {xml_file.name}: {e}", file=sys.stderr)
        raise


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Pack a directory into a DOCX, PPTX, or XLSX file"
    )
    parser.add_argument("input_directory", help="Unpacked Office document directory")
    parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)")
    parser.add_argument(
        "--original",
        help="Original file for validation comparison",
    )
    parser.add_argument(
        "--validate",
        type=lambda x: x.lower() == "true",
        default=True,
        metavar="true|false",
        help="Run validation with auto-repair (default: true)",
    )
    args = parser.parse_args()

    _, message = pack(
        args.input_directory,
        args.output_file,
        original_file=args.original,
        validate=args.validate,
    )
    print(message)

    if "Error" in message:
        sys.exit(1)

scripts/office/schemas/ISO-IEC29500-4_2016/dml-chart.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-chart.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-main.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-main.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-picture.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-picture.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/pml.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/pml.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-math.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-math.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/sml.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/sml.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/vml-main.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/vml-main.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/wml.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/wml.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/xml.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/xml.xsd

Binäre Ressource

scripts/office/schemas/ecma/fouth-edition/opc-contentTypes.xsd

Download scripts/office/schemas/ecma/fouth-edition/opc-contentTypes.xsd

Binäre Ressource

scripts/office/schemas/ecma/fouth-edition/opc-coreProperties.xsd

Download scripts/office/schemas/ecma/fouth-edition/opc-coreProperties.xsd

Binäre Ressource

scripts/office/schemas/ecma/fouth-edition/opc-digSig.xsd

Download scripts/office/schemas/ecma/fouth-edition/opc-digSig.xsd

Binäre Ressource

scripts/office/schemas/ecma/fouth-edition/opc-relationships.xsd

Download scripts/office/schemas/ecma/fouth-edition/opc-relationships.xsd

Binäre Ressource

scripts/office/schemas/mce/mc.xsd

Download scripts/office/schemas/mce/mc.xsd

Binäre Ressource

scripts/office/schemas/microsoft/wml-2010.xsd

Download scripts/office/schemas/microsoft/wml-2010.xsd

Binäre Ressource

scripts/office/schemas/microsoft/wml-2012.xsd

Scripts/office/schemas/microsoft/wml-2012.xsd herunterladen

Binäre Ressource

scripts/office/schemas/microsoft/wml-2018.xsd

Scripts/office/schemas/microsoft/wml-2018.xsd herunterladen

Binäre Ressource

scripts/office/schemas/microsoft/wml-cex-2018.xsd

Scripts/office/schemas/microsoft/wml-cex-2018.xsd herunterladen

Binäre Ressource

scripts/office/schemas/microsoft/wml-cid-2016.xsd

Scripts/office/schemas/microsoft/wml-cid-2016.xsd herunterladen

Binäre Ressource

scripts/office/schemas/microsoft/wml-sdtdatahash-2020.xsd

Download scripts/office/schemas/microsoft/wml-sdtdatahash-2020.xsd

Binäre Ressource

scripts/office/schemas/microsoft/wml-symex-2015.xsd

Download scripts/office/schemas/microsoft/wml-symex-2015.xsd

Binäre Ressource

scripts/office/soffice.py

Scripts/office/soffice.py herunterladen

"""
Helper for running LibreOffice (soffice) in environments where AF_UNIX
sockets may be blocked (e.g., sandboxed VMs).  Detects the restriction
at runtime and applies an LD_PRELOAD shim if needed.

Usage:
    from office.soffice import run_soffice, get_soffice_env

    # Option 1 – run soffice directly
    result = run_soffice(["--headless", "--convert-to", "pdf", "input.docx"])

    # Option 2 – get env dict for your own subprocess calls
    env = get_soffice_env()
    subprocess.run(["soffice", ...], env=env)
"""

import os
import socket
import subprocess
import tempfile
from pathlib import Path


def get_soffice_env() -> dict:
    env = os.environ.copy()
    env["SAL_USE_VCLPLUGIN"] = "svp"

    if _needs_shim():
        shim = _ensure_shim()
        env["LD_PRELOAD"] = str(shim)

    return env


def run_soffice(args: list[str], **kwargs) -> subprocess.CompletedProcess:
    env = get_soffice_env()
    return subprocess.run(["soffice"] + args, env=env, **kwargs)



_SHIM_SO = Path(tempfile.gettempdir()) / "lo_socket_shim.so"


def _needs_shim() -> bool:
    try:
        s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
        s.close()
        return False
    except OSError:
        return True


def _ensure_shim() -> Path:
    if _SHIM_SO.exists():
        return _SHIM_SO

    src = Path(tempfile.gettempdir()) / "lo_socket_shim.c"
    src.write_text(_SHIM_SOURCE)
    subprocess.run(
        ["gcc", "-shared", "-fPIC", "-o", str(_SHIM_SO), str(src), "-ldl"],
        check=True,
        capture_output=True,
    )
    src.unlink()
    return _SHIM_SO



_SHIM_SOURCE = r"""
#define _GNU_SOURCE
#include <dlfcn.h>
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <unistd.h>

static int (*real_socket)(int, int, int);
static int (*real_socketpair)(int, int, int, int[2]);
static int (*real_listen)(int, int);
static int (*real_accept)(int, struct sockaddr *, socklen_t *);
static int (*real_close)(int);
static int (*real_read)(int, void *, size_t);

/* Per-FD bookkeeping (FDs >= 1024 are passed through unshimmed). */
static int is_shimmed[1024];
static int peer_of[1024];
static int wake_r[1024];            /* accept() blocks reading this */
static int wake_w[1024];            /* close()  writes to this      */
static int listener_fd = -1;        /* FD that received listen()    */

__attribute__((constructor))
static void init(void) {
    real_socket     = dlsym(RTLD_NEXT, "socket");
    real_socketpair = dlsym(RTLD_NEXT, "socketpair");
    real_listen     = dlsym(RTLD_NEXT, "listen");
    real_accept     = dlsym(RTLD_NEXT, "accept");
    real_close      = dlsym(RTLD_NEXT, "close");
    real_read       = dlsym(RTLD_NEXT, "read");
    for (int i = 0; i < 1024; i++) {
        peer_of[i] = -1;
        wake_r[i]  = -1;
        wake_w[i]  = -1;
    }
}

/* ---- socket ---------------------------------------------------------- */
int socket(int domain, int type, int protocol) {
    if (domain == AF_UNIX) {
        int fd = real_socket(domain, type, protocol);
        if (fd >= 0) return fd;
        /* socket(AF_UNIX) blocked – fall back to socketpair(). */
        int sv[2];
        if (real_socketpair(domain, type, protocol, sv) == 0) {
            if (sv[0] >= 0 && sv[0] < 1024) {
                is_shimmed[sv[0]] = 1;
                peer_of[sv[0]]    = sv[1];
                int wp[2];
                if (pipe(wp) == 0) {
                    wake_r[sv[0]] = wp[0];
                    wake_w[sv[0]] = wp[1];
                }
            }
            return sv[0];
        }
        errno = EPERM;
        return -1;
    }
    return real_socket(domain, type, protocol);
}

/* ---- listen ---------------------------------------------------------- */
int listen(int sockfd, int backlog) {
    if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
        listener_fd = sockfd;
        return 0;
    }
    return real_listen(sockfd, backlog);
}

/* ---- accept ---------------------------------------------------------- */
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen) {
    if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
        /* Block until close() writes to the wake pipe. */
        if (wake_r[sockfd] >= 0) {
            char buf;
            real_read(wake_r[sockfd], &buf, 1);
        }
        errno = ECONNABORTED;
        return -1;
    }
    return real_accept(sockfd, addr, addrlen);
}

/* ---- close ----------------------------------------------------------- */
int close(int fd) {
    if (fd >= 0 && fd < 1024 && is_shimmed[fd]) {
        int was_listener = (fd == listener_fd);
        is_shimmed[fd] = 0;

        if (wake_w[fd] >= 0) {              /* unblock accept() */
            char c = 0;
            write(wake_w[fd], &c, 1);
            real_close(wake_w[fd]);
            wake_w[fd] = -1;
        }
        if (wake_r[fd] >= 0) { real_close(wake_r[fd]); wake_r[fd]  = -1; }
        if (peer_of[fd] >= 0) { real_close(peer_of[fd]); peer_of[fd] = -1; }

        if (was_listener)
            _exit(0);                        /* conversion done – exit */
    }
    return real_close(fd);
}
"""



if __name__ == "__main__":
    import sys
    result = run_soffice(sys.argv[1:])
    sys.exit(result.returncode)

scripts/office/unpack.py

Scripts/office/unpack.py herunterladen

"""Unpack Office files (DOCX, PPTX, XLSX) for editing.

Extracts the ZIP archive, pretty-prints XML files, and optionally:
- Merges adjacent runs with identical formatting (DOCX only)
- Simplifies adjacent tracked changes from same author (DOCX only)

Usage:
    python unpack.py <office_file> <output_dir> [options]

Examples:
    python unpack.py document.docx unpacked/
    python unpack.py presentation.pptx unpacked/
    python unpack.py document.docx unpacked/ --merge-runs false
"""

import argparse
import sys
import zipfile
from pathlib import Path

import defusedxml.minidom

from helpers.merge_runs import merge_runs as do_merge_runs
from helpers.simplify_redlines import simplify_redlines as do_simplify_redlines

SMART_QUOTE_REPLACEMENTS = {
    "\u201c": "&#x201C;",  
    "\u201d": "&#x201D;",  
    "\u2018": "&#x2018;",  
    "\u2019": "&#x2019;",  
}


def unpack(
    input_file: str,
    output_directory: str,
    merge_runs: bool = True,
    simplify_redlines: bool = True,
) -> tuple[None, str]:
    input_path = Path(input_file)
    output_path = Path(output_directory)
    suffix = input_path.suffix.lower()

    if not input_path.exists():
        return None, f"Error: {input_file} does not exist"

    if suffix not in {".docx", ".pptx", ".xlsx"}:
        return None, f"Error: {input_file} must be a .docx, .pptx, or .xlsx file"

    try:
        output_path.mkdir(parents=True, exist_ok=True)

        with zipfile.ZipFile(input_path, "r") as zf:
            zf.extractall(output_path)

        xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels"))
        for xml_file in xml_files:
            _pretty_print_xml(xml_file)

        message = f"Unpacked {input_file} ({len(xml_files)} XML files)"

        if suffix == ".docx":
            if simplify_redlines:
                simplify_count, _ = do_simplify_redlines(str(output_path))
                message += f", simplified {simplify_count} tracked changes"

            if merge_runs:
                merge_count, _ = do_merge_runs(str(output_path))
                message += f", merged {merge_count} runs"

        for xml_file in xml_files:
            _escape_smart_quotes(xml_file)

        return None, message

    except zipfile.BadZipFile:
        return None, f"Error: {input_file} is not a valid Office file"
    except Exception as e:
        return None, f"Error unpacking: {e}"


def _pretty_print_xml(xml_file: Path) -> None:
    try:
        content = xml_file.read_text(encoding="utf-8")
        dom = defusedxml.minidom.parseString(content)
        xml_file.write_bytes(dom.toprettyxml(indent="  ", encoding="utf-8"))
    except Exception:
        pass  


def _escape_smart_quotes(xml_file: Path) -> None:
    try:
        content = xml_file.read_text(encoding="utf-8")
        for char, entity in SMART_QUOTE_REPLACEMENTS.items():
            content = content.replace(char, entity)
        xml_file.write_text(content, encoding="utf-8")
    except Exception:
        pass


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Unpack an Office file (DOCX, PPTX, XLSX) for editing"
    )
    parser.add_argument("input_file", help="Office file to unpack")
    parser.add_argument("output_directory", help="Output directory")
    parser.add_argument(
        "--merge-runs",
        type=lambda x: x.lower() == "true",
        default=True,
        metavar="true|false",
        help="Merge adjacent runs with identical formatting (DOCX only, default: true)",
    )
    parser.add_argument(
        "--simplify-redlines",
        type=lambda x: x.lower() == "true",
        default=True,
        metavar="true|false",
        help="Merge adjacent tracked changes from same author (DOCX only, default: true)",
    )
    args = parser.parse_args()

    _, message = unpack(
        args.input_file,
        args.output_directory,
        merge_runs=args.merge_runs,
        simplify_redlines=args.simplify_redlines,
    )
    print(message)

    if "Error" in message:
        sys.exit(1)

scripts/office/validate.py

Scripts/office/validate.py herunterladen

"""
Command line tool to validate Office document XML files against XSD schemas and tracked changes.

Usage:
    python validate.py <path> [--original <original_file>] [--auto-repair] [--author NAME]

The first argument can be either:
- An unpacked directory containing the Office document XML files
- A packed Office file (.docx/.pptx/.xlsx) which will be unpacked to a temp directory

Auto-repair fixes:
- paraId/durableId values that exceed OOXML limits
- Missing xml:space="preserve" on w:t elements with whitespace
"""

import argparse
import sys
import tempfile
import zipfile
from pathlib import Path

from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator


def main():
    parser = argparse.ArgumentParser(description="Validate Office document XML files")
    parser.add_argument(
        "path",
        help="Path to unpacked directory or packed Office file (.docx/.pptx/.xlsx)",
    )
    parser.add_argument(
        "--original",
        required=False,
        default=None,
        help="Path to original file (.docx/.pptx/.xlsx). If omitted, all XSD errors are reported and redlining validation is skipped.",
    )
    parser.add_argument(
        "-v",
        "--verbose",
        action="store_true",
        help="Enable verbose output",
    )
    parser.add_argument(
        "--auto-repair",
        action="store_true",
        help="Automatically repair common issues (hex IDs, whitespace preservation)",
    )
    parser.add_argument(
        "--author",
        default="Claude",
        help="Author name for redlining validation (default: Claude)",
    )
    args = parser.parse_args()

    path = Path(args.path)
    assert path.exists(), f"Error: {path} does not exist"

    original_file = None
    if args.original:
        original_file = Path(args.original)
        assert original_file.is_file(), f"Error: {original_file} is not a file"
        assert original_file.suffix.lower() in [".docx", ".pptx", ".xlsx"], (
            f"Error: {original_file} must be a .docx, .pptx, or .xlsx file"
        )

    file_extension = (original_file or path).suffix.lower()
    assert file_extension in [".docx", ".pptx", ".xlsx"], (
        f"Error: Cannot determine file type from {path}. Use --original or provide a .docx/.pptx/.xlsx file."
    )

    if path.is_file() and path.suffix.lower() in [".docx", ".pptx", ".xlsx"]:
        temp_dir = tempfile.mkdtemp()
        with zipfile.ZipFile(path, "r") as zf:
            zf.extractall(temp_dir)
        unpacked_dir = Path(temp_dir)
    else:
        assert path.is_dir(), f"Error: {path} is not a directory or Office file"
        unpacked_dir = path

    match file_extension:
        case ".docx":
            validators = [
                DOCXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
            ]
            if original_file:
                validators.append(
                    RedliningValidator(unpacked_dir, original_file, verbose=args.verbose, author=args.author)  
                )
        case ".pptx":
            validators = [
                PPTXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
            ]
        case _:
            print(f"Error: Validation not supported for file type {file_extension}")
            sys.exit(1)

    if args.auto_repair:
        total_repairs = sum(v.repair() for v in validators)
        if total_repairs:
            print(f"Auto-repaired {total_repairs} issue(s)")

    success = all(v.validate() for v in validators)

    if success:
        print("All validations PASSED!")

    sys.exit(0 if success else 1)


if __name__ == "__main__":
    main()

scripts/office/validators/init.py

Scripts/office/validators/init.py herunterladen

"""
Validation modules for Word document processing.
"""

from .base import BaseSchemaValidator
from .docx import DOCXSchemaValidator
from .pptx import PPTXSchemaValidator
from .redlining import RedliningValidator

__all__ = [
    "BaseSchemaValidator",
    "DOCXSchemaValidator",
    "PPTXSchemaValidator",
    "RedliningValidator",
]

scripts/office/validators/base.py

Download scripts/office/validators/base.py

Binäre Ressource

scripts/office/validators/docx.py

Download scripts/office/validators/docx.py

Binäre Ressource

scripts/office/validators/pptx.py

Download scripts/office/validators/pptx.py

Binäre Ressource

scripts/office/validators/redlining.py

Download scripts/office/validators/redlining.py

Binäre Ressource

scripts/thumbnail.py

Scripts/thumbnail.py herunterladen

Binäre Ressource

Siehe in GitHub

Quelle: Inhalt angepasst von Anthropics/Skills (MIT).

Kurzreferenz

Aufgabe	Leitfaden
Inhalte lesen/analysieren	`python -m markitdown presentation.pptx`
Bearbeiten oder aus Vorlage erstellen	Lesen Sie editing.md
Von Grund auf neu erstellen	Lesen Sie pptxgenjs.md

Inhalte lesen

# Text extraction
python -m markitdown presentation.pptx

# Visual overview
python scripts/thumbnail.py presentation.pptx

# Raw XML
python scripts/office/unpack.py presentation.pptx unpacked/

Bearbeitungsworkflow

Weitere Informationen finden Sie unter editing.md.

Vorlage mitthumbnail.pyanalysieren
Auspacken -> Folien bearbeiten -> Inhalt bearbeiten -> Bereinigen -> Packen

Von Grund auf neu erstellen

Lesen Sie pptxgenjs.md für vollständige Details.

Verwenden Sie diese Option, wenn keine Vorlage oder Referenzpräsentation verfügbar ist.

Designideen

Erstellen Sie keine langweiligen Folien. Einfache Aufzählungszeichen auf weißem Hintergrund werden niemanden beeindrucken. Berücksichtigen Sie für jede Folie Ideen aus dieser Liste.

Vor dem Start

Wählen Sie eine kräftige, inhaltsorientierte Farbpalette: Die Palette sollte für DIESES Thema konzipiert sein. Wenn der Austausch Ihrer Farben in eine völlig andere Präsentation immer noch „funktionieren“ würde, haben Sie nicht ausreichend konkrete Entscheidungen getroffen.
Dominanz vor Gleichheit: Eine Farbe sollte dominieren (60-70 % visuelles Gewicht), mit 1-2 unterstützenden Tönen und einem scharfen Akzent. Geben Sie niemals allen Farben das gleiche Gewicht.
Dunkel-Hell-Kontrast: Dunkle Hintergründe für Titel- und Abschlussfolien, hell für Inhalte („Sandwich“-Struktur). Oder entscheiden Sie sich für ein erstklassiges Gefühl für eine durchgehend dunkle Farbe.
Legen Sie sich auf ein visuelles Motiv fest: Wählen Sie EIN markantes Element und wiederholen Sie es – abgerundete Bildrahmen, Symbole in farbigen Kreisen, dicke einseitige Ränder. Tragen Sie es über jede Folie.

Farbpaletten

Wählen Sie Farben, die zu Ihrem Thema passen – verwenden Sie nicht standardmäßig Blau. Nutzen Sie diese Paletten als Inspiration:

Thema	Primär	Sekundär	Akzent
Midnight Executive	`1E2761`(Marine)	`CADCFC`(eisblau)	`FFFFFF`(weiß)
Wald & Moos	`2C5F2D`(Wald)	`97BC62`(Moos)	`F5F5F5`(Creme)
Korallenenergie	`F96167`(Koralle)	`F9E795`(Gold)	`2F3C7E`(Marine)
Warmes Terrakotta	`B85042`(Terrakotta)	`E7E8D1`(Sand)	`A7BEAE`(Salbei)
Meeresgefälle	`065A82`(tiefblau)	`1C7293`(blaugrün)	`21295C`(Mitternacht)
Kohle Minimal	`36454F`(Holzkohle)	`F2F2F2`(cremefarben)	`212121`(schwarz)
Teal Trust	`028090`(blaugrün)	`00A896`(Seeschaum)	`02C39A`(Minze)
Beere & Sahne	`6D2E46`(Beere)	`A26769`(staubige Rose)	`ECE2D0`(Creme)
Salbei ruhig	`84B59F`(Salbei)	`69A297`(Eukalyptus)	`50808E`(Schiefer)
Cherry Bold	`990011`(Kirsche)	`FCF6F5`(cremefarben)	`2F3C7E`(Marine)

Für jede Folie

Jede Folie benötigt ein visuelles Element – Bild, Diagramm, Symbol oder Form. Folien, die nur aus Text bestehen, kann man vergessen.

Layoutoptionen:

Zweispaltig (Text links, Abbildung rechts)
Symbol + Textzeilen (Symbol im farbigen Kreis, fett gedruckte Überschrift, Beschreibung unten)
2x2- oder 2x3-Raster (Bild auf einer Seite, Raster mit Inhaltsblöcken auf der anderen)
Halbrandiges Bild (ganz links oder rechts) mit Inhaltsüberlagerung

Datenanzeige:

Große Statistikbeschriftungen (große Zahlen 60–72 pt mit kleinen Beschriftungen unten)
Vergleichsspalten (Vorher/Nachher, Vor-/Nachteile, Nebeneinander-Optionen)
Zeitleiste oder Prozessablauf (nummerierte Schritte, Pfeile)

Optischer Schliff:

Symbole in kleinen farbigen Kreisen neben Abschnittsüberschriften
Kursiver Akzenttext für wichtige Statistiken oder Slogans

Typografie

Header-Schriftart	Körperschrift
Georgien	Calibri
Arial Schwarz	Arial
Calibri	Calibri Light
Cambria	Calibri
Trebuchet MS	Calibri
Auswirkungen	Arial
Palatino	Garamond
Konsolen	Calibri

Element	Größe
Folientitel	36-44pt fett
Abschnittsüberschrift	20-24pt fett
Fließtext	14-16pt
Bildunterschriften	10-12pt gedämpft

Abstand

Mindestränder von 0,5 Zoll
0,3–0,5 Zoll zwischen den Inhaltsblöcken
Lassen Sie Raum zum Atmen – füllen Sie nicht jeden Zentimeter aus

Vermeiden Sie (häufige Fehler)

Wiederholen Sie nicht dasselbe Layout – variieren Sie Spalten, Karten und Beschriftungen auf den einzelnen Folien
Körpertext nicht zentrieren – Absätze und Listen linksbündig ausrichten; Nur Titel zentrieren
Sparen Sie nicht beim Größenkontrast – Titel benötigen 36pt+, um sich vom 14-16pt-Text abzuheben
Nicht standardmäßig Blau verwenden – wählen Sie Farben aus, die das spezifische Thema widerspiegeln
Mischen Sie die Abstände nicht willkürlich – wählen Sie Abstände von 0,3" oder 0,5" und verwenden Sie sie konsequent
Stileln Sie nicht eine Folie und lassen Sie den Rest schlicht – legen Sie fest oder halten Sie es durchgehend einfach
Erstellen Sie keine Folien, die nur aus Text bestehen – fügen Sie Bilder, Symbole, Diagramme oder visuelle Elemente hinzu; Vermeiden Sie einfache Titel + Aufzählungszeichen
Vergessen Sie nicht den Textfeldabstand – wenn Sie Linien oder Formen an Textkanten ausrichten, legen Siemargin: 0für das Textfeld fest oder versetzen Sie die Form, um den Abstand zu berücksichtigen
Verwenden Sie keine kontrastarmen Elemente – Symbole UND Text benötigen einen starken Kontrast zum Hintergrund; Vermeiden Sie hellen Text auf hellem Hintergrund oder dunklen Text auf dunklem Hintergrund
Verwenden Sie NIEMALS Akzentlinien unter Titeln – diese sind ein Markenzeichen von KI-generierten Folien; Verwenden Sie stattdessen Leerzeichen oder Hintergrundfarbe

Qualitätssicherung (erforderlich)

Geht davon aus, dass es Probleme gibt. Ihre Aufgabe ist es, sie zu finden.

Inhaltsqualitätssicherung

python -m markitdown output.pptx

Suchen Sie nach fehlenden Inhalten, Tippfehlern und falscher Reihenfolge.

Wenn Sie Vorlagen verwenden, prüfen Sie, ob noch Platzhaltertext übrig ist:

python -m markitdown output.pptx | grep -iE "xxxx|lorem|ipsum|this.*(page|slide).*layout"

Wenn grep Ergebnisse zurückgibt, korrigieren Sie diese, bevor Sie den Erfolg melden.

Visuelle Qualitätssicherung

** VERWENDEN SIE UNTERGENTEN** – auch für 2-3 Folien. Sie haben auf den Code gestarrt und werden sehen, was Sie erwarten, nicht, was da ist. Subagenten haben neue Augen.

Konvertieren Sie Folien in Bilder (siehe In Bilder konvertieren), dann verwenden Sie diese Eingabeaufforderung:

Visually inspect these slides. Assume there are issues - find them.

Look for:
- Overlapping elements (text through shapes, lines through words, stacked elements)
- Text overflow or cut off at edges/box boundaries
- Decorative lines positioned for single-line text but title wrapped to two lines
- Source citations or footers colliding with content above
- Elements too close (< 0.3" gaps) or cards/sections nearly touching
- Uneven gaps (large empty area in one place, cramped in another)
- Insufficient margin from slide edges (< 0.5")
- Columns or similar elements not aligned consistently
- Low-contrast text (e.g., light gray text on cream-colored background)
- Low-contrast icons (e.g., dark icons on dark backgrounds without a contrasting circle)
- Text boxes too narrow causing excessive wrapping
- Leftover placeholder content

For each slide, list issues or areas of concern, even if minor.

Read and analyze these images:
1. /path/to/slide-01.jpg (Expected: [brief description])
2. /path/to/slide-02.jpg (Expected: [brief description])

Report ALL issues found, including minor ones.

Verifizierungsschleife

Folien erstellen -> In Bilder konvertieren -> Überprüfen
Gefundene Probleme auflisten (Wenn keine gefunden werden, schauen Sie noch einmal kritischer nach)
Beheben Sie Probleme
Betroffene Folien erneut überprüfen – ein Fix führt oft zu einem weiteren Problem
Wiederholen Sie diesen Vorgang, bis ein vollständiger Durchgang keine neuen Probleme mehr offenbart

Erklären Sie den Erfolg erst, wenn Sie mindestens einen Korrektur- und Überprüfungszyklus abgeschlossen haben.

Konvertieren in Bilder

Konvertieren Sie Präsentationen zur visuellen Überprüfung in einzelne Folienbilder:

python scripts/office/soffice.py --headless --convert-to pdf output.pptx
pdftoppm -jpeg -r 150 output.pdf slide

Dadurch werdenslide-01.jpg,slide-02.jpgusw. erstellt.

So rendern Sie bestimmte Folien nach Korrekturen erneut:

pdftoppm -jpeg -r 150 -f N -l N output.pdf slide-fixed

Abhängigkeiten

pip install "markitdown[pptx]"– Textextraktion
pip install Pillow– Miniaturbildraster
npm install -g pptxgenjs– von Grund auf neu erstellen
LibreOffice (soffice) – PDF-Konvertierung (automatisch konfiguriert für Sandbox-Umgebungen überscripts/office/soffice.py)
Poppler (pdftoppm) – PDF zu Bildern

"""Merge adjacent runs with identical formatting in DOCX.

Merges adjacent <w:r> elements that have identical <w:rPr> properties.
Works on runs in paragraphs and inside tracked changes (<w:ins>, <w:del>).

Also:
- Removes rsid attributes from runs (revision metadata that doesn't affect rendering)
- Removes proofErr elements (spell/grammar markers that block merging)
"""

from pathlib import Path

import defusedxml.minidom


def merge_runs(input_dir: str) -> tuple[int, str]:
    doc_xml = Path(input_dir) / "word" / "document.xml"

    if not doc_xml.exists():
        return 0, f"Error: {doc_xml} not found"

    try:
        dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
        root = dom.documentElement

        _remove_elements(root, "proofErr")
        _strip_run_rsid_attrs(root)

        containers = {run.parentNode for run in _find_elements(root, "r")}

        merge_count = 0
        for container in containers:
            merge_count += _merge_runs_in(container)

        doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
        return merge_count, f"Merged {merge_count} runs"

    except Exception as e:
        return 0, f"Error: {e}"




def _find_elements(root, tag: str) -> list:
    results = []

    def traverse(node):
        if node.nodeType == node.ELEMENT_NODE:
            name = node.localName or node.tagName
            if name == tag or name.endswith(f":{tag}"):
                results.append(node)
            for child in node.childNodes:
                traverse(child)

    traverse(root)
    return results


def _get_child(parent, tag: str):
    for child in parent.childNodes:
        if child.nodeType == child.ELEMENT_NODE:
            name = child.localName or child.tagName
            if name == tag or name.endswith(f":{tag}"):
                return child
    return None


def _get_children(parent, tag: str) -> list:
    results = []
    for child in parent.childNodes:
        if child.nodeType == child.ELEMENT_NODE:
            name = child.localName or child.tagName
            if name == tag or name.endswith(f":{tag}"):
                results.append(child)
    return results


def _is_adjacent(elem1, elem2) -> bool:
    node = elem1.nextSibling
    while node:
        if node == elem2:
            return True
        if node.nodeType == node.ELEMENT_NODE:
            return False
        if node.nodeType == node.TEXT_NODE and node.data.strip():
            return False
        node = node.nextSibling
    return False




def _remove_elements(root, tag: str):
    for elem in _find_elements(root, tag):
        if elem.parentNode:
            elem.parentNode.removeChild(elem)


def _strip_run_rsid_attrs(root):
    for run in _find_elements(root, "r"):
        for attr in list(run.attributes.values()):
            if "rsid" in attr.name.lower():
                run.removeAttribute(attr.name)




def _merge_runs_in(container) -> int:
    merge_count = 0
    run = _first_child_run(container)

    while run:
        while True:
            next_elem = _next_element_sibling(run)
            if next_elem and _is_run(next_elem) and _can_merge(run, next_elem):
                _merge_run_content(run, next_elem)
                container.removeChild(next_elem)
                merge_count += 1
            else:
                break

        _consolidate_text(run)
        run = _next_sibling_run(run)

    return merge_count


def _first_child_run(container):
    for child in container.childNodes:
        if child.nodeType == child.ELEMENT_NODE and _is_run(child):
            return child
    return None


def _next_element_sibling(node):
    sibling = node.nextSibling
    while sibling:
        if sibling.nodeType == sibling.ELEMENT_NODE:
            return sibling
        sibling = sibling.nextSibling
    return None


def _next_sibling_run(node):
    sibling = node.nextSibling
    while sibling:
        if sibling.nodeType == sibling.ELEMENT_NODE:
            if _is_run(sibling):
                return sibling
        sibling = sibling.nextSibling
    return None


def _is_run(node) -> bool:
    name = node.localName or node.tagName
    return name == "r" or name.endswith(":r")


def _can_merge(run1, run2) -> bool:
    rpr1 = _get_child(run1, "rPr")
    rpr2 = _get_child(run2, "rPr")

    if (rpr1 is None) != (rpr2 is None):
        return False
    if rpr1 is None:
        return True
    return rpr1.toxml() == rpr2.toxml()  


def _merge_run_content(target, source):
    for child in list(source.childNodes):
        if child.nodeType == child.ELEMENT_NODE:
            name = child.localName or child.tagName
            if name != "rPr" and not name.endswith(":rPr"):
                target.appendChild(child)


def _consolidate_text(run):
    t_elements = _get_children(run, "t")

    for i in range(len(t_elements) - 1, 0, -1):
        curr, prev = t_elements[i], t_elements[i - 1]

        if _is_adjacent(prev, curr):
            prev_text = prev.firstChild.data if prev.firstChild else ""
            curr_text = curr.firstChild.data if curr.firstChild else ""
            merged = prev_text + curr_text

            if prev.firstChild:
                prev.firstChild.data = merged
            else:
                prev.appendChild(run.ownerDocument.createTextNode(merged))

            if merged.startswith(" ") or merged.endswith(" "):
                prev.setAttribute("xml:space", "preserve")
            elif prev.hasAttribute("xml:space"):
                prev.removeAttribute("xml:space")

            run.removeChild(curr)

scripts/office/helpers/simplify_redlines.py

Download scripts/office/helpers/simplify_redlines.py

"""Simplify tracked changes by merging adjacent w:ins or w:del elements.

Merges adjacent <w:ins> elements from the same author into a single element.
Same for <w:del> elements. This makes heavily-redlined documents easier to
work with by reducing the number of tracked change wrappers.

Rules:
- Only merges w:ins with w:ins, w:del with w:del (same element type)
- Only merges if same author (ignores timestamp differences)
- Only merges if truly adjacent (only whitespace between them)
"""

import xml.etree.ElementTree as ET
import zipfile
from pathlib import Path

import defusedxml.minidom

WORD_NS = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"


def simplify_redlines(input_dir: str) -> tuple[int, str]:
    doc_xml = Path(input_dir) / "word" / "document.xml"

    if not doc_xml.exists():
        return 0, f"Error: {doc_xml} not found"

    try:
        dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
        root = dom.documentElement

        merge_count = 0

        containers = _find_elements(root, "p") + _find_elements(root, "tc")

        for container in containers:
            merge_count += _merge_tracked_changes_in(container, "ins")
            merge_count += _merge_tracked_changes_in(container, "del")

        doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
        return merge_count, f"Simplified {merge_count} tracked changes"

    except Exception as e:
        return 0, f"Error: {e}"


def _merge_tracked_changes_in(container, tag: str) -> int:
    merge_count = 0

    tracked = [
        child
        for child in container.childNodes
        if child.nodeType == child.ELEMENT_NODE and _is_element(child, tag)
    ]

    if len(tracked) < 2:
        return 0

    i = 0
    while i < len(tracked) - 1:
        curr = tracked[i]
        next_elem = tracked[i + 1]

        if _can_merge_tracked(curr, next_elem):
            _merge_tracked_content(curr, next_elem)
            container.removeChild(next_elem)
            tracked.pop(i + 1)
            merge_count += 1
        else:
            i += 1

    return merge_count


def _is_element(node, tag: str) -> bool:
    name = node.localName or node.tagName
    return name == tag or name.endswith(f":{tag}")


def _get_author(elem) -> str:
    author = elem.getAttribute("w:author")
    if not author:
        for attr in elem.attributes.values():
            if attr.localName == "author" or attr.name.endswith(":author"):
                return attr.value
    return author


def _can_merge_tracked(elem1, elem2) -> bool:
    if _get_author(elem1) != _get_author(elem2):
        return False

    node = elem1.nextSibling
    while node and node != elem2:
        if node.nodeType == node.ELEMENT_NODE:
            return False
        if node.nodeType == node.TEXT_NODE and node.data.strip():
            return False
        node = node.nextSibling

    return True


def _merge_tracked_content(target, source):
    while source.firstChild:
        child = source.firstChild
        source.removeChild(child)
        target.appendChild(child)


def _find_elements(root, tag: str) -> list:
    results = []

    def traverse(node):
        if node.nodeType == node.ELEMENT_NODE:
            name = node.localName or node.tagName
            if name == tag or name.endswith(f":{tag}"):
                results.append(node)
            for child in node.childNodes:
                traverse(child)

    traverse(root)
    return results


def get_tracked_change_authors(doc_xml_path: Path) -> dict[str, int]:
    if not doc_xml_path.exists():
        return {}

    try:
        tree = ET.parse(doc_xml_path)
        root = tree.getroot()
    except ET.ParseError:
        return {}

    namespaces = {"w": WORD_NS}
    author_attr = f"{{{WORD_NS}}}author"

    authors: dict[str, int] = {}
    for tag in ["ins", "del"]:
        for elem in root.findall(f".//w:{tag}", namespaces):
            author = elem.get(author_attr)
            if author:
                authors[author] = authors.get(author, 0) + 1

    return authors


def _get_authors_from_docx(docx_path: Path) -> dict[str, int]:
    try:
        with zipfile.ZipFile(docx_path, "r") as zf:
            if "word/document.xml" not in zf.namelist():
                return {}
            with zf.open("word/document.xml") as f:
                tree = ET.parse(f)
                root = tree.getroot()

                namespaces = {"w": WORD_NS}
                author_attr = f"{{{WORD_NS}}}author"

                authors: dict[str, int] = {}
                for tag in ["ins", "del"]:
                    for elem in root.findall(f".//w:{tag}", namespaces):
                        author = elem.get(author_attr)
                        if author:
                            authors[author] = authors.get(author, 0) + 1
                return authors
    except (zipfile.BadZipFile, ET.ParseError):
        return {}


def infer_author(modified_dir: Path, original_docx: Path, default: str = "Claude") -> str:
    modified_xml = modified_dir / "word" / "document.xml"
    modified_authors = get_tracked_change_authors(modified_xml)

    if not modified_authors:
        return default

    original_authors = _get_authors_from_docx(original_docx)

    new_changes: dict[str, int] = {}
    for author, count in modified_authors.items():
        original_count = original_authors.get(author, 0)
        diff = count - original_count
        if diff > 0:
            new_changes[author] = diff

    if not new_changes:
        return default

    if len(new_changes) == 1:
        return next(iter(new_changes))

    raise ValueError(
        f"Multiple authors added new changes: {new_changes}. "
        "Cannot infer which author to validate."
    )

scripts/office/pack.py

Scripts/office/pack.py herunterladen

"""Pack a directory into a DOCX, PPTX, or XLSX file.

Validates with auto-repair, condenses XML formatting, and creates the Office file.

Usage:
    python pack.py <input_directory> <output_file> [--original <file>] [--validate true|false]

Examples:
    python pack.py unpacked/ output.docx --original input.docx
    python pack.py unpacked/ output.pptx --validate false
"""

import argparse
import sys
import shutil
import tempfile
import zipfile
from pathlib import Path

import defusedxml.minidom

from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator

def pack(
    input_directory: str,
    output_file: str,
    original_file: str | None = None,
    validate: bool = True,
    infer_author_func=None,
) -> tuple[None, str]:
    input_dir = Path(input_directory)
    output_path = Path(output_file)
    suffix = output_path.suffix.lower()

    if not input_dir.is_dir():
        return None, f"Error: {input_dir} is not a directory"

    if suffix not in {".docx", ".pptx", ".xlsx"}:
        return None, f"Error: {output_file} must be a .docx, .pptx, or .xlsx file"

    if validate and original_file:
        original_path = Path(original_file)
        if original_path.exists():
            success, output = _run_validation(
                input_dir, original_path, suffix, infer_author_func
            )
            if output:
                print(output)
            if not success:
                return None, f"Error: Validation failed for {input_dir}"

    with tempfile.TemporaryDirectory() as temp_dir:
        temp_content_dir = Path(temp_dir) / "content"
        shutil.copytree(input_dir, temp_content_dir)

        for pattern in ["*.xml", "*.rels"]:
            for xml_file in temp_content_dir.rglob(pattern):
                _condense_xml(xml_file)

        output_path.parent.mkdir(parents=True, exist_ok=True)
        with zipfile.ZipFile(output_path, "w", zipfile.ZIP_DEFLATED) as zf:
            for f in temp_content_dir.rglob("*"):
                if f.is_file():
                    zf.write(f, f.relative_to(temp_content_dir))

    return None, f"Successfully packed {input_dir} to {output_file}"


def _run_validation(
    unpacked_dir: Path,
    original_file: Path,
    suffix: str,
    infer_author_func=None,
) -> tuple[bool, str | None]:
    output_lines = []
    validators = []

    if suffix == ".docx":
        author = "Claude"
        if infer_author_func:
            try:
                author = infer_author_func(unpacked_dir, original_file)
            except ValueError as e:
                print(f"Warning: {e} Using default author 'Claude'.", file=sys.stderr)

        validators = [
            DOCXSchemaValidator(unpacked_dir, original_file),
            RedliningValidator(unpacked_dir, original_file, author=author),
        ]
    elif suffix == ".pptx":
        validators = [PPTXSchemaValidator(unpacked_dir, original_file)]

    if not validators:
        return True, None

    total_repairs = sum(v.repair() for v in validators)
    if total_repairs:
        output_lines.append(f"Auto-repaired {total_repairs} issue(s)")

    success = all(v.validate() for v in validators)

    if success:
        output_lines.append("All validations PASSED!")

    return success, "\n".join(output_lines) if output_lines else None


def _condense_xml(xml_file: Path) -> None:
    try:
        with open(xml_file, encoding="utf-8") as f:
            dom = defusedxml.minidom.parse(f)

        for element in dom.getElementsByTagName("*"):
            if element.tagName.endswith(":t"):
                continue

            for child in list(element.childNodes):
                if (
                    child.nodeType == child.TEXT_NODE
                    and child.nodeValue
                    and child.nodeValue.strip() == ""
                ) or child.nodeType == child.COMMENT_NODE:
                    element.removeChild(child)

        xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
    except Exception as e:
        print(f"ERROR: Failed to parse {xml_file.name}: {e}", file=sys.stderr)
        raise


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Pack a directory into a DOCX, PPTX, or XLSX file"
    )
    parser.add_argument("input_directory", help="Unpacked Office document directory")
    parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)")
    parser.add_argument(
        "--original",
        help="Original file for validation comparison",
    )
    parser.add_argument(
        "--validate",
        type=lambda x: x.lower() == "true",
        default=True,
        metavar="true|false",
        help="Run validation with auto-repair (default: true)",
    )
    args = parser.parse_args()

    _, message = pack(
        args.input_directory,
        args.output_file,
        original_file=args.original,
        validate=args.validate,
    )
    print(message)

    if "Error" in message:
        sys.exit(1)

"""
Helper for running LibreOffice (soffice) in environments where AF_UNIX
sockets may be blocked (e.g., sandboxed VMs).  Detects the restriction
at runtime and applies an LD_PRELOAD shim if needed.

Usage:
    from office.soffice import run_soffice, get_soffice_env

    # Option 1 – run soffice directly
    result = run_soffice(["--headless", "--convert-to", "pdf", "input.docx"])

    # Option 2 – get env dict for your own subprocess calls
    env = get_soffice_env()
    subprocess.run(["soffice", ...], env=env)
"""

import os
import socket
import subprocess
import tempfile
from pathlib import Path


def get_soffice_env() -> dict:
    env = os.environ.copy()
    env["SAL_USE_VCLPLUGIN"] = "svp"

    if _needs_shim():
        shim = _ensure_shim()
        env["LD_PRELOAD"] = str(shim)

    return env


def run_soffice(args: list[str], **kwargs) -> subprocess.CompletedProcess:
    env = get_soffice_env()
    return subprocess.run(["soffice"] + args, env=env, **kwargs)



_SHIM_SO = Path(tempfile.gettempdir()) / "lo_socket_shim.so"


def _needs_shim() -> bool:
    try:
        s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
        s.close()
        return False
    except OSError:
        return True


def _ensure_shim() -> Path:
    if _SHIM_SO.exists():
        return _SHIM_SO

    src = Path(tempfile.gettempdir()) / "lo_socket_shim.c"
    src.write_text(_SHIM_SOURCE)
    subprocess.run(
        ["gcc", "-shared", "-fPIC", "-o", str(_SHIM_SO), str(src), "-ldl"],
        check=True,
        capture_output=True,
    )
    src.unlink()
    return _SHIM_SO



_SHIM_SOURCE = r"""
#define _GNU_SOURCE
#include <dlfcn.h>
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <unistd.h>

static int (*real_socket)(int, int, int);
static int (*real_socketpair)(int, int, int, int[2]);
static int (*real_listen)(int, int);
static int (*real_accept)(int, struct sockaddr *, socklen_t *);
static int (*real_close)(int);
static int (*real_read)(int, void *, size_t);

/* Per-FD bookkeeping (FDs >= 1024 are passed through unshimmed). */
static int is_shimmed[1024];
static int peer_of[1024];
static int wake_r[1024];            /* accept() blocks reading this */
static int wake_w[1024];            /* close()  writes to this      */
static int listener_fd = -1;        /* FD that received listen()    */

__attribute__((constructor))
static void init(void) {
    real_socket     = dlsym(RTLD_NEXT, "socket");
    real_socketpair = dlsym(RTLD_NEXT, "socketpair");
    real_listen     = dlsym(RTLD_NEXT, "listen");
    real_accept     = dlsym(RTLD_NEXT, "accept");
    real_close      = dlsym(RTLD_NEXT, "close");
    real_read       = dlsym(RTLD_NEXT, "read");
    for (int i = 0; i < 1024; i++) {
        peer_of[i] = -1;
        wake_r[i]  = -1;
        wake_w[i]  = -1;
    }
}

/* ---- socket ---------------------------------------------------------- */
int socket(int domain, int type, int protocol) {
    if (domain == AF_UNIX) {
        int fd = real_socket(domain, type, protocol);
        if (fd >= 0) return fd;
        /* socket(AF_UNIX) blocked – fall back to socketpair(). */
        int sv[2];
        if (real_socketpair(domain, type, protocol, sv) == 0) {
            if (sv[0] >= 0 && sv[0] < 1024) {
                is_shimmed[sv[0]] = 1;
                peer_of[sv[0]]    = sv[1];
                int wp[2];
                if (pipe(wp) == 0) {
                    wake_r[sv[0]] = wp[0];
                    wake_w[sv[0]] = wp[1];
                }
            }
            return sv[0];
        }
        errno = EPERM;
        return -1;
    }
    return real_socket(domain, type, protocol);
}

/* ---- listen ---------------------------------------------------------- */
int listen(int sockfd, int backlog) {
    if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
        listener_fd = sockfd;
        return 0;
    }
    return real_listen(sockfd, backlog);
}

/* ---- accept ---------------------------------------------------------- */
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen) {
    if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
        /* Block until close() writes to the wake pipe. */
        if (wake_r[sockfd] >= 0) {
            char buf;
            real_read(wake_r[sockfd], &buf, 1);
        }
        errno = ECONNABORTED;
        return -1;
    }
    return real_accept(sockfd, addr, addrlen);
}

/* ---- close ----------------------------------------------------------- */
int close(int fd) {
    if (fd >= 0 && fd < 1024 && is_shimmed[fd]) {
        int was_listener = (fd == listener_fd);
        is_shimmed[fd] = 0;

        if (wake_w[fd] >= 0) {              /* unblock accept() */
            char c = 0;
            write(wake_w[fd], &c, 1);
            real_close(wake_w[fd]);
            wake_w[fd] = -1;
        }
        if (wake_r[fd] >= 0) { real_close(wake_r[fd]); wake_r[fd]  = -1; }
        if (peer_of[fd] >= 0) { real_close(peer_of[fd]); peer_of[fd] = -1; }

        if (was_listener)
            _exit(0);                        /* conversion done – exit */
    }
    return real_close(fd);
}
"""



if __name__ == "__main__":
    import sys
    result = run_soffice(sys.argv[1:])
    sys.exit(result.returncode)

scripts/office/unpack.py

Scripts/office/unpack.py herunterladen

"""Unpack Office files (DOCX, PPTX, XLSX) for editing.

Extracts the ZIP archive, pretty-prints XML files, and optionally:
- Merges adjacent runs with identical formatting (DOCX only)
- Simplifies adjacent tracked changes from same author (DOCX only)

Usage:
    python unpack.py <office_file> <output_dir> [options]

Examples:
    python unpack.py document.docx unpacked/
    python unpack.py presentation.pptx unpacked/
    python unpack.py document.docx unpacked/ --merge-runs false
"""

import argparse
import sys
import zipfile
from pathlib import Path

import defusedxml.minidom

from helpers.merge_runs import merge_runs as do_merge_runs
from helpers.simplify_redlines import simplify_redlines as do_simplify_redlines

SMART_QUOTE_REPLACEMENTS = {
    "\u201c": "&#x201C;",  
    "\u201d": "&#x201D;",  
    "\u2018": "&#x2018;",  
    "\u2019": "&#x2019;",  
}


def unpack(
    input_file: str,
    output_directory: str,
    merge_runs: bool = True,
    simplify_redlines: bool = True,
) -> tuple[None, str]:
    input_path = Path(input_file)
    output_path = Path(output_directory)
    suffix = input_path.suffix.lower()

    if not input_path.exists():
        return None, f"Error: {input_file} does not exist"

    if suffix not in {".docx", ".pptx", ".xlsx"}:
        return None, f"Error: {input_file} must be a .docx, .pptx, or .xlsx file"

    try:
        output_path.mkdir(parents=True, exist_ok=True)

        with zipfile.ZipFile(input_path, "r") as zf:
            zf.extractall(output_path)

        xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels"))
        for xml_file in xml_files:
            _pretty_print_xml(xml_file)

        message = f"Unpacked {input_file} ({len(xml_files)} XML files)"

        if suffix == ".docx":
            if simplify_redlines:
                simplify_count, _ = do_simplify_redlines(str(output_path))
                message += f", simplified {simplify_count} tracked changes"

            if merge_runs:
                merge_count, _ = do_merge_runs(str(output_path))
                message += f", merged {merge_count} runs"

        for xml_file in xml_files:
            _escape_smart_quotes(xml_file)

        return None, message

    except zipfile.BadZipFile:
        return None, f"Error: {input_file} is not a valid Office file"
    except Exception as e:
        return None, f"Error unpacking: {e}"


def _pretty_print_xml(xml_file: Path) -> None:
    try:
        content = xml_file.read_text(encoding="utf-8")
        dom = defusedxml.minidom.parseString(content)
        xml_file.write_bytes(dom.toprettyxml(indent="  ", encoding="utf-8"))
    except Exception:
        pass  


def _escape_smart_quotes(xml_file: Path) -> None:
    try:
        content = xml_file.read_text(encoding="utf-8")
        for char, entity in SMART_QUOTE_REPLACEMENTS.items():
            content = content.replace(char, entity)
        xml_file.write_text(content, encoding="utf-8")
    except Exception:
        pass


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Unpack an Office file (DOCX, PPTX, XLSX) for editing"
    )
    parser.add_argument("input_file", help="Office file to unpack")
    parser.add_argument("output_directory", help="Output directory")
    parser.add_argument(
        "--merge-runs",
        type=lambda x: x.lower() == "true",
        default=True,
        metavar="true|false",
        help="Merge adjacent runs with identical formatting (DOCX only, default: true)",
    )
    parser.add_argument(
        "--simplify-redlines",
        type=lambda x: x.lower() == "true",
        default=True,
        metavar="true|false",
        help="Merge adjacent tracked changes from same author (DOCX only, default: true)",
    )
    args = parser.parse_args()

    _, message = unpack(
        args.input_file,
        args.output_directory,
        merge_runs=args.merge_runs,
        simplify_redlines=args.simplify_redlines,
    )
    print(message)

    if "Error" in message:
        sys.exit(1)

scripts/office/validate.py

Scripts/office/validate.py herunterladen

"""
Command line tool to validate Office document XML files against XSD schemas and tracked changes.

Usage:
    python validate.py <path> [--original <original_file>] [--auto-repair] [--author NAME]

The first argument can be either:
- An unpacked directory containing the Office document XML files
- A packed Office file (.docx/.pptx/.xlsx) which will be unpacked to a temp directory

Auto-repair fixes:
- paraId/durableId values that exceed OOXML limits
- Missing xml:space="preserve" on w:t elements with whitespace
"""

import argparse
import sys
import tempfile
import zipfile
from pathlib import Path

from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator


def main():
    parser = argparse.ArgumentParser(description="Validate Office document XML files")
    parser.add_argument(
        "path",
        help="Path to unpacked directory or packed Office file (.docx/.pptx/.xlsx)",
    )
    parser.add_argument(
        "--original",
        required=False,
        default=None,
        help="Path to original file (.docx/.pptx/.xlsx). If omitted, all XSD errors are reported and redlining validation is skipped.",
    )
    parser.add_argument(
        "-v",
        "--verbose",
        action="store_true",
        help="Enable verbose output",
    )
    parser.add_argument(
        "--auto-repair",
        action="store_true",
        help="Automatically repair common issues (hex IDs, whitespace preservation)",
    )
    parser.add_argument(
        "--author",
        default="Claude",
        help="Author name for redlining validation (default: Claude)",
    )
    args = parser.parse_args()

    path = Path(args.path)
    assert path.exists(), f"Error: {path} does not exist"

    original_file = None
    if args.original:
        original_file = Path(args.original)
        assert original_file.is_file(), f"Error: {original_file} is not a file"
        assert original_file.suffix.lower() in [".docx", ".pptx", ".xlsx"], (
            f"Error: {original_file} must be a .docx, .pptx, or .xlsx file"
        )

    file_extension = (original_file or path).suffix.lower()
    assert file_extension in [".docx", ".pptx", ".xlsx"], (
        f"Error: Cannot determine file type from {path}. Use --original or provide a .docx/.pptx/.xlsx file."
    )

    if path.is_file() and path.suffix.lower() in [".docx", ".pptx", ".xlsx"]:
        temp_dir = tempfile.mkdtemp()
        with zipfile.ZipFile(path, "r") as zf:
            zf.extractall(temp_dir)
        unpacked_dir = Path(temp_dir)
    else:
        assert path.is_dir(), f"Error: {path} is not a directory or Office file"
        unpacked_dir = path

    match file_extension:
        case ".docx":
            validators = [
                DOCXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
            ]
            if original_file:
                validators.append(
                    RedliningValidator(unpacked_dir, original_file, verbose=args.verbose, author=args.author)  
                )
        case ".pptx":
            validators = [
                PPTXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
            ]
        case _:
            print(f"Error: Validation not supported for file type {file_extension}")
            sys.exit(1)

    if args.auto_repair:
        total_repairs = sum(v.repair() for v in validators)
        if total_repairs:
            print(f"Auto-repaired {total_repairs} issue(s)")

    success = all(v.validate() for v in validators)

    if success:
        print("All validations PASSED!")

    sys.exit(0 if success else 1)


if __name__ == "__main__":
    main()

scripts/office/validators/init.py

Scripts/office/validators/init.py herunterladen

"""
Validation modules for Word document processing.
"""

from .base import BaseSchemaValidator
from .docx import DOCXSchemaValidator
from .pptx import PPTXSchemaValidator
from .redlining import RedliningValidator

__all__ = [
    "BaseSchemaValidator",
    "DOCXSchemaValidator",
    "PPTXSchemaValidator",
    "RedliningValidator",
]

Pptx

Inhaltsverzeichnis

Pptx