사용자가 PDF 파일로 무엇이든 하고 싶을 때마다 이 기술을 사용하세요. 여기에는 PDF에서 텍스트/표 읽기 또는 추출, 여러 PDF를 하나로 결합 또는 병합, PDF 분할, 페이지 회전, 워터마크 추가, 새 PDF 생성, PDF 양식 채우기, PDF 암호화/암호 해독, 이미지 추출 및 스캔한 PDF에서 OCR을 포함하여 검색 가능하게 만듭니다. 사용자가.pdf 파일을 언급하거나 파일 생성을 요청하는 경우 이 기술을 사용하세요.
출처: 인류학/기술(MIT)에서 채택한 콘텐츠.
개요
이 가이드에서는 Python 라이브러리와 명령줄 도구를 사용한 필수 PDF 처리 작업을 다룹니다. 고급 기능, JavaScript 라이브러리 및 자세한 예는 REFERENCE.md를 참조하세요. PDF 양식을 작성해야 하는 경우 FORMS.md를 읽고 해당 지침을 따르십시오.
빠른 시작
from pypdf import PdfReader, PdfWriter
# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
# Extract text
text = ""
for page in reader.pages:
text += page.extract_text()Python 라이브러리
pypdf - 기본 작업
PDF 병합
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)PDF 분할
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)메타데이터 추출
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")페이지 회전
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)pdfplumumber - 텍스트 및 테이블 추출
레이아웃으로 텍스트 추출
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)테이블 추출
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)고급 테이블 추출
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table: # Check if table is not empty
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
# Combine all tables
if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)Reportlab - PDF 만들기
기본 PDF 생성
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
# Add text
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "This is a PDF created with reportlab")
# Add a line
c.line(100, height - 140, 400, height - 140)
# Save
c.save()여러 페이지로 PDF 만들기
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
# Add content
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())
# Page 2
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))
# Build PDF
doc.build(story)아래 첨자와 위 첨자
중요: ReportLab PDF에는 유니코드 아래 첨자/위 첨자 문자(0123456789, 0123456789)를 사용하지 마십시오. 내장 글꼴에는 이러한 문자 모양이 포함되어 있지 않으므로 단색 검정 상자로 렌더링됩니다.
대신 Paragraph 개체에 ReportLab의 XML 마크업 태그를 사용하세요.
from reportlab.platypus import Paragraph
from reportlab.lib.styles import getSampleStyleSheet
styles = getSampleStyleSheet()
# Subscripts: use <sub> tag
chemical = Paragraph("H<sub>2</sub>O", styles['Normal'])
# Superscripts: use <super> tag
squared = Paragraph("x<super>2</super> + y<super>2</super>", styles['Normal'])캔버스로 그린 텍스트(단락 개체 아님)의 경우 유니코드 아래 첨자/위 첨자를 사용하는 대신 글꼴 크기와 위치를 수동으로 조정합니다.
명령줄 도구
pdftotext (poppler-utils)
# Extract text
pdftotext input.pdf output.txt
# Extract text preserving layout
pdftotext -layout input.pdf output.txt
# Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5qpdf
# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
# Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees
# Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdfpdftk(사용 가능한 경우)
# Merge
pdftk file1.pdf file2.pdf cat output merged.pdf
# Split
pdftk input.pdf burst
# Rotate
pdftk input.pdf rotate 1east output rotated.pdf일반적인 작업
스캔한 PDF에서 텍스트 추출
# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
# Convert PDF to images
images = convert_from_path('scanned.pdf')
# OCR each page
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)워터마크 추가
from pypdf import PdfReader, PdfWriter
# Create watermark (or load existing)
watermark = PdfReader("watermark.pdf").pages[0]
# Apply to all pages
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)이미지 추출
# Using pdfimages (poppler-utils)
pdfimages -j input.pdf output_prefix
# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.비밀번호 보호
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
# Add password
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)빠른 참조
| 작업 | 최고의 도구 | 명령/코드 |
|---|---|---|
| PDF 병합 | pypdf | writer.add_page(page) |
| PDF 분할 | pypdf | 파일당 한 페이지 |
| 텍스트 추출 | pdf배관공 | page.extract_text() |
| 테이블 추출 | pdf배관공 | page.extract_tables() |
| PDF 작성 | 보고서 연구실 | 캔버스 또는 오리너구리 |
| 명령줄 병합 | qpdf | qpdf --empty --pages... |
| OCR 스캔 PDF | 피테서랙트 | 먼저 이미지로 변환 |
| PDF 양식 채우기 | pdf-lib 또는 pypdf(FORMS.md 참조) | FORMS.md |
다음 단계
- 고급 pypdfium2 사용법은 REFERENCE.md를 참조하세요.
- JavaScript 라이브러리(pdf-lib)의 경우 REFERENCE.md를 참조하세요.
- PDF 양식을 작성해야 하는 경우 FORMS.md의 지침을 따르세요.
- 문제 해결 가이드는 REFERENCE.md를 참조하세요.
리소스 파일
라이센스.txt
바이너리 리소스
form.md
바이너리 리소스
참조.md
바이너리 리소스
스크립트/check_bounding_boxes.py
스크립트 다운로드/check_bounding_boxes.py
from dataclasses import dataclass
import json
import sys
@dataclass
class RectAndField:
rect: list[float]
rect_type: str
field: dict
def get_bounding_box_messages(fields_json_stream) -> list[str]:
messages = []
fields = json.load(fields_json_stream)
messages.append(f"Read {len(fields['form_fields'])} fields")
def rects_intersect(r1, r2):
disjoint_horizontal = r1[0] >= r2[2] or r1[2] <= r2[0]
disjoint_vertical = r1[1] >= r2[3] or r1[3] <= r2[1]
return not (disjoint_horizontal or disjoint_vertical)
rects_and_fields = []
for f in fields["form_fields"]:
rects_and_fields.append(RectAndField(f["label_bounding_box"], "label", f))
rects_and_fields.append(RectAndField(f["entry_bounding_box"], "entry", f))
has_error = False
for i, ri in enumerate(rects_and_fields):
for j in range(i + 1, len(rects_and_fields)):
rj = rects_and_fields[j]
if ri.field["page_number"] == rj.field["page_number"] and rects_intersect(ri.rect, rj.rect):
has_error = True
if ri.field is rj.field:
messages.append(f"FAILURE: intersection between label and entry bounding boxes for `{ri.field['description']}` ({ri.rect}, {rj.rect})")
else:
messages.append(f"FAILURE: intersection between {ri.rect_type} bounding box for `{ri.field['description']}` ({ri.rect}) and {rj.rect_type} bounding box for `{rj.field['description']}` ({rj.rect})")
if len(messages) >= 20:
messages.append("Aborting further checks; fix bounding boxes and try again")
return messages
if ri.rect_type == "entry":
if "entry_text" in ri.field:
font_size = ri.field["entry_text"].get("font_size", 14)
entry_height = ri.rect[3] - ri.rect[1]
if entry_height < font_size:
has_error = True
messages.append(f"FAILURE: entry bounding box height ({entry_height}) for `{ri.field['description']}` is too short for the text content (font size: {font_size}). Increase the box height or decrease the font size.")
if len(messages) >= 20:
messages.append("Aborting further checks; fix bounding boxes and try again")
return messages
if not has_error:
messages.append("SUCCESS: All bounding boxes are valid")
return messages
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: check_bounding_boxes.py [fields.json]")
sys.exit(1)
with open(sys.argv[1]) as f:
messages = get_bounding_box_messages(f)
for msg in messages:
print(msg)스크립트/check_fillable_fields.py
스크립트 다운로드/check_fillable_fields.py
import sys
from pypdf import PdfReader
reader = PdfReader(sys.argv[1])
if (reader.get_fields()):
print("This PDF has fillable form fields")
else:
print("This PDF does not have fillable form fields; you will need to visually determine where to enter data")스크립트/convert_pdf_to_images.py
스크립트 다운로드/convert_pdf_to_images.py
import os
import sys
from pdf2image import convert_from_path
def convert(pdf_path, output_dir, max_dim=1000):
images = convert_from_path(pdf_path, dpi=200)
for i, image in enumerate(images):
width, height = image.size
if width > max_dim or height > max_dim:
scale_factor = min(max_dim / width, max_dim / height)
new_width = int(width * scale_factor)
new_height = int(height * scale_factor)
image = image.resize((new_width, new_height))
image_path = os.path.join(output_dir, f"page_{i+1}.png")
image.save(image_path)
print(f"Saved page {i+1} as {image_path} (size: {image.size})")
print(f"Converted {len(images)} pages to PNG images")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: convert_pdf_to_images.py [input pdf] [output directory]")
sys.exit(1)
pdf_path = sys.argv[1]
output_directory = sys.argv[2]
convert(pdf_path, output_directory)스크립트/create_validation_image.py
스크립트 다운로드/create_validation_image.py
import json
import sys
from PIL import Image, ImageDraw
def create_validation_image(page_number, fields_json_path, input_path, output_path):
with open(fields_json_path, 'r') as f:
data = json.load(f)
img = Image.open(input_path)
draw = ImageDraw.Draw(img)
num_boxes = 0
for field in data["form_fields"]:
if field["page_number"] == page_number:
entry_box = field['entry_bounding_box']
label_box = field['label_bounding_box']
draw.rectangle(entry_box, outline='red', width=2)
draw.rectangle(label_box, outline='blue', width=2)
num_boxes += 2
img.save(output_path)
print(f"Created validation image at {output_path} with {num_boxes} bounding boxes")
if __name__ == "__main__":
if len(sys.argv) != 5:
print("Usage: create_validation_image.py [page number] [fields.json file] [input image path] [output image path]")
sys.exit(1)
page_number = int(sys.argv[1])
fields_json_path = sys.argv[2]
input_image_path = sys.argv[3]
output_image_path = sys.argv[4]
create_validation_image(page_number, fields_json_path, input_image_path, output_image_path)스크립트/extract_form_field_info.py
스크립트 다운로드/extract_form_field_info.py
import json
import sys
from pypdf import PdfReader
def get_full_annotation_field_id(annotation):
components = []
while annotation:
field_name = annotation.get('/T')
if field_name:
components.append(field_name)
annotation = annotation.get('/Parent')
return ".".join(reversed(components)) if components else None
def make_field_dict(field, field_id):
field_dict = {"field_id": field_id}
ft = field.get('/FT')
if ft == "/Tx":
field_dict["type"] = "text"
elif ft == "/Btn":
field_dict["type"] = "checkbox"
states = field.get("/_States_", [])
if len(states) == 2:
if "/Off" in states:
field_dict["checked_value"] = states[0] if states[0] != "/Off" else states[1]
field_dict["unchecked_value"] = "/Off"
else:
print(f"Unexpected state values for checkbox `${field_id}`. Its checked and unchecked values may not be correct; if you're trying to check it, visually verify the results.")
field_dict["checked_value"] = states[0]
field_dict["unchecked_value"] = states[1]
elif ft == "/Ch":
field_dict["type"] = "choice"
states = field.get("/_States_", [])
field_dict["choice_options"] = [{
"value": state[0],
"text": state[1],
} for state in states]
else:
field_dict["type"] = f"unknown ({ft})"
return field_dict
def get_field_info(reader: PdfReader):
fields = reader.get_fields()
field_info_by_id = {}
possible_radio_names = set()
for field_id, field in fields.items():
if field.get("/Kids"):
if field.get("/FT") == "/Btn":
possible_radio_names.add(field_id)
continue
field_info_by_id[field_id] = make_field_dict(field, field_id)
radio_fields_by_id = {}
for page_index, page in enumerate(reader.pages):
annotations = page.get('/Annots', [])
for ann in annotations:
field_id = get_full_annotation_field_id(ann)
if field_id in field_info_by_id:
field_info_by_id[field_id]["page"] = page_index + 1
field_info_by_id[field_id]["rect"] = ann.get('/Rect')
elif field_id in possible_radio_names:
try:
on_values = [v for v in ann["/AP"]["/N"] if v != "/Off"]
except KeyError:
continue
if len(on_values) == 1:
rect = ann.get("/Rect")
if field_id not in radio_fields_by_id:
radio_fields_by_id[field_id] = {
"field_id": field_id,
"type": "radio_group",
"page": page_index + 1,
"radio_options": [],
}
radio_fields_by_id[field_id]["radio_options"].append({
"value": on_values[0],
"rect": rect,
})
fields_with_location = []
for field_info in field_info_by_id.values():
if "page" in field_info:
fields_with_location.append(field_info)
else:
print(f"Unable to determine location for field id: {field_info.get('field_id')}, ignoring")
def sort_key(f):
if "radio_options" in f:
rect = f["radio_options"][0]["rect"] or [0, 0, 0, 0]
else:
rect = f.get("rect") or [0, 0, 0, 0]
adjusted_position = [-rect[1], rect[0]]
return [f.get("page"), adjusted_position]
sorted_fields = fields_with_location + list(radio_fields_by_id.values())
sorted_fields.sort(key=sort_key)
return sorted_fields
def write_field_info(pdf_path: str, json_output_path: str):
reader = PdfReader(pdf_path)
field_info = get_field_info(reader)
with open(json_output_path, "w") as f:
json.dump(field_info, f, indent=2)
print(f"Wrote {len(field_info)} fields to {json_output_path}")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: extract_form_field_info.py [input pdf] [output json]")
sys.exit(1)
write_field_info(sys.argv[1], sys.argv[2])스크립트/extract_form_structure.py
스크립트 다운로드/extract_form_structure.py
"""
Extract form structure from a non-fillable PDF.
This script analyzes the PDF to find:
- Text labels with their exact coordinates
- Horizontal lines (row boundaries)
- Checkboxes (small rectangles)
Output: A JSON file with the form structure that can be used to generate
accurate field coordinates for filling.
Usage: python extract_form_structure.py <input.pdf> <output.json>
"""
import json
import sys
import pdfplumber
def extract_form_structure(pdf_path):
structure = {
"pages": [],
"labels": [],
"lines": [],
"checkboxes": [],
"row_boundaries": []
}
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
structure["pages"].append({
"page_number": page_num,
"width": float(page.width),
"height": float(page.height)
})
words = page.extract_words()
for word in words:
structure["labels"].append({
"page": page_num,
"text": word["text"],
"x0": round(float(word["x0"]), 1),
"top": round(float(word["top"]), 1),
"x1": round(float(word["x1"]), 1),
"bottom": round(float(word["bottom"]), 1)
})
for line in page.lines:
if abs(float(line["x1"]) - float(line["x0"])) > page.width * 0.5:
structure["lines"].append({
"page": page_num,
"y": round(float(line["top"]), 1),
"x0": round(float(line["x0"]), 1),
"x1": round(float(line["x1"]), 1)
})
for rect in page.rects:
width = float(rect["x1"]) - float(rect["x0"])
height = float(rect["bottom"]) - float(rect["top"])
if 5 <= width <= 15 and 5 <= height <= 15 and abs(width - height) < 2:
structure["checkboxes"].append({
"page": page_num,
"x0": round(float(rect["x0"]), 1),
"top": round(float(rect["top"]), 1),
"x1": round(float(rect["x1"]), 1),
"bottom": round(float(rect["bottom"]), 1),
"center_x": round((float(rect["x0"]) + float(rect["x1"])) / 2, 1),
"center_y": round((float(rect["top"]) + float(rect["bottom"])) / 2, 1)
})
lines_by_page = {}
for line in structure["lines"]:
page = line["page"]
if page not in lines_by_page:
lines_by_page[page] = []
lines_by_page[page].append(line["y"])
for page, y_coords in lines_by_page.items():
y_coords = sorted(set(y_coords))
for i in range(len(y_coords) - 1):
structure["row_boundaries"].append({
"page": page,
"row_top": y_coords[i],
"row_bottom": y_coords[i + 1],
"row_height": round(y_coords[i + 1] - y_coords[i], 1)
})
return structure
def main():
if len(sys.argv) != 3:
print("Usage: extract_form_structure.py <input.pdf> <output.json>")
sys.exit(1)
pdf_path = sys.argv[1]
output_path = sys.argv[2]
print(f"Extracting structure from {pdf_path}...")
structure = extract_form_structure(pdf_path)
with open(output_path, "w") as f:
json.dump(structure, f, indent=2)
print(f"Found:")
print(f" - {len(structure['pages'])} pages")
print(f" - {len(structure['labels'])} text labels")
print(f" - {len(structure['lines'])} horizontal lines")
print(f" - {len(structure['checkboxes'])} checkboxes")
print(f" - {len(structure['row_boundaries'])} row boundaries")
print(f"Saved to {output_path}")
if __name__ == "__main__":
main()스크립트/fill_fillable_fields.py
스크립트 다운로드/fill_fillable_fields.py
import json
import sys
from pypdf import PdfReader, PdfWriter
from extract_form_field_info import get_field_info
def fill_pdf_fields(input_pdf_path: str, fields_json_path: str, output_pdf_path: str):
with open(fields_json_path) as f:
fields = json.load(f)
fields_by_page = {}
for field in fields:
if "value" in field:
field_id = field["field_id"]
page = field["page"]
if page not in fields_by_page:
fields_by_page[page] = {}
fields_by_page[page][field_id] = field["value"]
reader = PdfReader(input_pdf_path)
has_error = False
field_info = get_field_info(reader)
fields_by_ids = {f["field_id"]: f for f in field_info}
for field in fields:
existing_field = fields_by_ids.get(field["field_id"])
if not existing_field:
has_error = True
print(f"ERROR: `{field['field_id']}` is not a valid field ID")
elif field["page"] != existing_field["page"]:
has_error = True
print(f"ERROR: Incorrect page number for `{field['field_id']}` (got {field['page']}, expected {existing_field['page']})")
else:
if "value" in field:
err = validation_error_for_field_value(existing_field, field["value"])
if err:
print(err)
has_error = True
if has_error:
sys.exit(1)
writer = PdfWriter(clone_from=reader)
for page, field_values in fields_by_page.items():
writer.update_page_form_field_values(writer.pages[page - 1], field_values, auto_regenerate=False)
writer.set_need_appearances_writer(True)
with open(output_pdf_path, "wb") as f:
writer.write(f)
def validation_error_for_field_value(field_info, field_value):
field_type = field_info["type"]
field_id = field_info["field_id"]
if field_type == "checkbox":
checked_val = field_info["checked_value"]
unchecked_val = field_info["unchecked_value"]
if field_value != checked_val and field_value != unchecked_val:
return f'ERROR: Invalid value "{field_value}" for checkbox field "{field_id}". The checked value is "{checked_val}" and the unchecked value is "{unchecked_val}"'
elif field_type == "radio_group":
option_values = [opt["value"] for opt in field_info["radio_options"]]
if field_value not in option_values:
return f'ERROR: Invalid value "{field_value}" for radio group field "{field_id}". Valid values are: {option_values}'
elif field_type == "choice":
choice_values = [opt["value"] for opt in field_info["choice_options"]]
if field_value not in choice_values:
return f'ERROR: Invalid value "{field_value}" for choice field "{field_id}". Valid values are: {choice_values}'
return None
def monkeypatch_pydpf_method():
from pypdf.generic import DictionaryObject
from pypdf.constants import FieldDictionaryAttributes
original_get_inherited = DictionaryObject.get_inherited
def patched_get_inherited(self, key: str, default = None):
result = original_get_inherited(self, key, default)
if key == FieldDictionaryAttributes.Opt:
if isinstance(result, list) and all(isinstance(v, list) and len(v) == 2 for v in result):
result = [r[0] for r in result]
return result
DictionaryObject.get_inherited = patched_get_inherited
if __name__ == "__main__":
if len(sys.argv) != 4:
print("Usage: fill_fillable_fields.py [input pdf] [field_values.json] [output pdf]")
sys.exit(1)
monkeypatch_pydpf_method()
input_pdf = sys.argv[1]
fields_json = sys.argv[2]
output_pdf = sys.argv[3]
fill_pdf_fields(input_pdf, fields_json, output_pdf)스크립트/fill_pdf_form_with_annotations.py
스크립트 다운로드/fill_pdf_form_with_annotations.py
import json
import sys
from pypdf import PdfReader, PdfWriter
from pypdf.annotations import FreeText
def transform_from_image_coords(bbox, image_width, image_height, pdf_width, pdf_height):
x_scale = pdf_width / image_width
y_scale = pdf_height / image_height
left = bbox[0] * x_scale
right = bbox[2] * x_scale
top = pdf_height - (bbox[1] * y_scale)
bottom = pdf_height - (bbox[3] * y_scale)
return left, bottom, right, top
def transform_from_pdf_coords(bbox, pdf_height):
left = bbox[0]
right = bbox[2]
pypdf_top = pdf_height - bbox[1]
pypdf_bottom = pdf_height - bbox[3]
return left, pypdf_bottom, right, pypdf_top
def fill_pdf_form(input_pdf_path, fields_json_path, output_pdf_path):
with open(fields_json_path, "r") as f:
fields_data = json.load(f)
reader = PdfReader(input_pdf_path)
writer = PdfWriter()
writer.append(reader)
pdf_dimensions = {}
for i, page in enumerate(reader.pages):
mediabox = page.mediabox
pdf_dimensions[i + 1] = [mediabox.width, mediabox.height]
annotations = []
for field in fields_data["form_fields"]:
page_num = field["page_number"]
page_info = next(p for p in fields_data["pages"] if p["page_number"] == page_num)
pdf_width, pdf_height = pdf_dimensions[page_num]
if "pdf_width" in page_info:
transformed_entry_box = transform_from_pdf_coords(
field["entry_bounding_box"],
float(pdf_height)
)
else:
image_width = page_info["image_width"]
image_height = page_info["image_height"]
transformed_entry_box = transform_from_image_coords(
field["entry_bounding_box"],
image_width, image_height,
float(pdf_width), float(pdf_height)
)
if "entry_text" not in field or "text" not in field["entry_text"]:
continue
entry_text = field["entry_text"]
text = entry_text["text"]
if not text:
continue
font_name = entry_text.get("font", "Arial")
font_size = str(entry_text.get("font_size", 14)) + "pt"
font_color = entry_text.get("font_color", "000000")
annotation = FreeText(
text=text,
rect=transformed_entry_box,
font=font_name,
font_size=font_size,
font_color=font_color,
border_color=None,
background_color=None,
)
annotations.append(annotation)
writer.add_annotation(page_number=page_num - 1, annotation=annotation)
with open(output_pdf_path, "wb") as output:
writer.write(output)
print(f"Successfully filled PDF form and saved to {output_pdf_path}")
print(f"Added {len(annotations)} text annotations")
if __name__ == "__main__":
if len(sys.argv) != 4:
print("Usage: fill_pdf_form_with_annotations.py [input pdf] [fields.json] [output pdf]")
sys.exit(1)
input_pdf = sys.argv[1]
fields_json = sys.argv[2]
output_pdf = sys.argv[3]
fill_pdf_form(input_pdf, fields_json, output_pdf)GitHub에서 보기
Docx
사용자가 Word 문서(.docx 파일)를 생성, 읽기, 편집 또는 조작하려고 할 때마다 이 기술을 사용하십시오. 트리거에는 'Word 문서', '워드 문서', '.docx'에 대한 언급 또는 목차, 제목, 페이지 번호 또는 레터헤드와 같은 형식의 전문 문서 생성 요청이 포함됩니다. 또한.docx 파일에서 콘텐츠를 추출 또는 재구성할 때, 문서에 이미지를 삽입하거나 바꿀 때, Word 파일에서 찾기 및 바꾸기를 수행할 때, 추적된 변경 내용이나 주석 작업을 할 때, 콘텐츠를 세련된 Word 문서로 변환할 때 사용합니다. 사용자가 '보고서', '메모', '편지', '템플릿' 또는 Word 또는.docx 파일과 유사한 결과물을 요청하는 경우 이 기술을 사용하세요. PDF, 스프레드시트, Google Docs 또는 문서 생성과 관련 없는 일반 코딩 작업에는 사용하지 마세요.
Pptx
.pptx 파일이 어떤 방식으로든(입력, 출력 또는 둘 다) 관련될 때마다 이 기술을 사용하세요. 여기에는 슬라이드 데크, 피치 데크 또는 프레젠테이션 만들기가 포함됩니다..pptx 파일에서 텍스트 읽기, 구문 분석 또는 추출(추출된 콘텐츠가 이메일이나 요약과 같은 다른 곳에서 사용되는 경우에도) 기존 프리젠테이션을 편집, 수정 또는 업데이트합니다. 슬라이드 파일을 결합하거나 분할합니다. 템플릿, 레이아웃, 발표자 노트 또는 댓글 작업. 나중에 콘텐츠로 무엇을 할 계획인지에 관계없이 사용자가 "데크", "슬라이드", "프레젠테이션"을 언급하거나.pptx 파일 이름을 참조할 때마다 트리거됩니다..pptx 파일을 열거나 생성하거나 터치해야 하는 경우 이 기술을 사용하세요.
클로데스킬스 문서