用Python實(shí)現(xiàn)一款永久免費(fèi)的PDF編輯工具

作者：芝麻觀(guān) 2020-08-20 14:15:11

PDF（Portable Document Format），中文名稱(chēng)便攜文檔格式是我們經(jīng)常會(huì)接觸到的一種文件格式，文獻(xiàn)、文檔...很多都是PDF格式。它以格式穩(wěn)定的優(yōu)勢(shì)，使得我們?cè)诖蛴　⒎窒怼鬏斶^(guò)程中能夠最優(yōu)的保持原有色彩和格式。

前言：

PDF（Portable Document Format），中文名稱(chēng)便攜文檔格式是我們經(jīng)常會(huì)接觸到的一種文件格式，文獻(xiàn)、文檔...很多都是PDF格式。它以格式穩(wěn)定的優(yōu)勢(shì)，使得我們?cè)诖蛴　⒎窒怼鬏斶^(guò)程中能夠最優(yōu)的保持原有色彩和格式。

永久免費(fèi)的PDF編輯工具">

但是在可編輯性方面卻為使用者引入了另外一個(gè)困擾。

曾經(jīng)，為了替換PDF中的一頁(yè)，我?guī)缀踉嚤榱怂惺忻嫔现髁鞯腜DF工具，最終還是不得不選擇使用付費(fèi)工具來(lái)解決問(wèn)題。

事后想了想，既然這些商業(yè)化軟件不靠譜，為什么不考慮自己動(dòng)手開(kāi)發(fā)一款工具呢？明明幾十行代碼能夠解決的問(wèn)題，為什么要費(fèi)那么多勁去下載、安裝那些沒(méi)有節(jié)操的軟件呢？

本文就來(lái)介紹一下利用Python輕松開(kāi)發(fā)一款PDF編輯工具，可以用于PDF轉(zhuǎn)TxT、分割、合并、剪切、轉(zhuǎn)換。

有請(qǐng)主角登場(chǎng) PyPDF2 和 pdfminer3k

PyPDF2

簡(jiǎn)介：由純 Python 構(gòu)建的PDF 工具包。它能夠：

提取文檔信息（標(biāo)題、作者等）
一頁(yè)拆分文檔
按頁(yè)合并文檔
裁剪頁(yè)面
將多個(gè)頁(yè)面合并到單個(gè)頁(yè)面中
加密和解密 PDF 文件

安裝

直接使用pip安裝

pip install PyPDF2

代碼操作

簡(jiǎn)單的讀寫(xiě)PDF操作

from PyPDF2 import PdfFileReader, PdfFileWriter 
infn = 'infn.pdf' 
outfn = 'outfn.pdf' 
# 獲取一個(gè) PdfFileReader 對(duì)象 
pdf_input = PdfFileReader(open(infn, 'rb')) 
# 獲取PDF 的基本信息 
information =pdf_input.getDocumentInfo() 
print(information) 
# 獲取 PDF 的頁(yè)數(shù) 
page_count = pdf_input.getNumPages() 
print(page_count) 
# 返回一個(gè) PageObject 
page = pdf_input.getPage(i) 
 
# 獲取一個(gè) PdfFileWriter 對(duì)象 
pdf_output = PdfFileWriter() 
# 將一個(gè) PageObject 加入到 PdfFileWriter 中 
pdf_output.addPage(page) 
# 輸出到文件中 
pdf_output.write(open(outfn, 'wb'))

刪除PDF頁(yè)

from PyPDF2 import PdfFileWriter,  PdfFileReader 
 
# 實(shí)例化一個(gè)輸出的PDF實(shí)例 
output = PdfFileWriter() 
#  讀取一個(gè)PDF文件 
input1 = PdfFileReader(open("example.pdf", "rb"))  
 
# 要?jiǎng)h除的操作 
def delete_pdf(index): 
            pages = input1.getNumPages()  
# 循環(huán)刪除 
     for i in range(pages): 
      if i+1 in index: 
       continue 
      output.addPage(input1.getPage(i))  
 
     outputStream = open("PyPDF2-output.pdf", "wb") 
     output.write(outputStream)   
 
delete_pdf([2,3,4])

合并PDF

from PyPDF2 import PdfFileWriter, PdfFileReader 
 
output = PdfFileWriter() 
input1 = PdfFileReader(open("example.pdf", "rb")) 
input2 = PdfFileReader(open("simple2.pdf", "rb")) // 1 
 
def merge_pdf(add_index, origin_index): 
         pages = input1.getNumPages() 
         k = 0 
         for i in range(pages): 
          if i+1 in add_index: 
               output.addPage(input2.getPage(origin_index[k])) // 2 
               pages += 1 
               k += 1 
              output.addPage(input1.getPage(i)) 
 
         outputStream = open("PyPDF2-output.pdf", "wb") 
         output.write(outputStream) 
 
merge_pdf([2,3,4], [0, 0, 0])

旋轉(zhuǎn)

# 旋轉(zhuǎn)90度 
input1.getPage(1).rotateClockwise(90)

添加水印

page = input1.getPage(3) 
watermark = PdfFileReader(open("watermark.pdf", "rb")) 
page.mergePage(watermark.getPage(0))

加密

password = "secret" 
output.encrypt(password)

解密

print(output.decrypt('secret'))# secret==正確口令顯示1，其他顯示0 
page_obj= output.getPage(0)# 這樣才能正確讀取 
print(page_obj.extractText())

pdfminer3k

簡(jiǎn)介

pdfminer3k 是一個(gè) Python 3 端口的 pdfminer 。PDFMiner 是一個(gè)從 PDF 文檔中提取信息的工具。與其他與 PDF 相關(guān)的工具不同，它完全側(cè)重于獲取和分析文本數(shù)據(jù)。PDFMiner 允許獲取頁(yè)面中文本的確切位置，以及其他信息，如字體或線(xiàn)條。它包括一個(gè) PDF 轉(zhuǎn)換器，可以將 PDF 文件轉(zhuǎn)換為其他文本格式（如 HTML）。它有一個(gè)可擴(kuò)展的PDF解析器，可用于其他目的，而不是文本分析

- 能夠準(zhǔn)確獲取文本的位置和布局信息；
- 可以將PDF轉(zhuǎn)換為HTML/XML等格式；
- 可以提取目錄；
- 可以提取標(biāo)簽內(nèi)容；
- 支持各種字體類(lèi)型（Type1、TrueType、Type3和CID）;
- 支持中、日、韓語(yǔ)言和垂直書(shū)寫(xiě)文本;
安裝

pip install pdfminer3k

文件的操作

from urllib.request import urlopen 
 
from pdfminer.converter import PDFPageAggregator 
from pdfminer.layout import LAParams 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 
from pdfminer.pdfparser import PDFParser, PDFDocument 
 
logging.Logger.propagate = False 
logging.getLogger().setLevel(logging.ERROR) 
 
fp = open('template/pdftest.pdf', 'rb') 
# 在線(xiàn) 
# fp = urlopen('http://---/---.pdf') 
 
# 創(chuàng)建一個(gè)與文檔關(guān)聯(lián)的解析器 
parser = PDFParser(fp) 
 
# PDF文檔對(duì)象 
doc = PDFDocument() 
 
#創(chuàng)建pdf文檔對(duì)象，存儲(chǔ)文檔結(jié)構(gòu) 
document = PDFDocument(parser, password) 
 
# 鏈接解析器和文檔對(duì)象 
parser.set_document(doc) 
doc.set_parser(parser) 
 
# 初始化文檔 
doc.initialize("") 
 
# 創(chuàng)建DPF資源管理器 
resource = PDFResourceManager() 
 
# 參數(shù)分析器 
laparam = LAParams() 
 
# 聚合器 
device = PDFPageAggregator(resource, laparams=laparam) 
 
# 創(chuàng)建頁(yè)面解析器 
interpreter = PDFPageInterpreter(resource, device) 
 
# 使用文檔對(duì)象從pdf中讀取內(nèi)容 
for page in doc.get_pages(): 
    # 使用頁(yè)面解析器 
    interpreter.process_page(page) 
 
    # 使用聚合器獲取內(nèi)容 
    layout = device.get_result() 
 
    for text_obj in layout: 
        # 判斷是否有g(shù)et_text屬性 
        if hasattr(text_obj, 'get_text'): 
            print(text_obj.get_text())

# 處理包含在文檔中的每一頁(yè) 
for page in PDFPage.create_pages(document): 
          interpreter.process_page(page) 
          layout = device.get_result() 
          for x in layout: 
              # 獲取文本對(duì)象 
              if isinstance(x, LTTextBox): 
                  print(x.get_text().strip()) 
              # 獲取圖片對(duì)象 
              if isinstance(x,LTImage): 
                  print('這里獲取到一張圖片') 
              # 獲取 figure 對(duì)象 
              if isinstance(x,LTFigure): 
                  print('這里獲取到一個(gè) figure 對(duì)象')

詳細(xì)的操作可參考官網(wǎng)：https://github.com/canserhat77/pdfminer3k

總結(jié)

通過(guò)上述2款Python庫(kù)，就可以實(shí)現(xiàn)從頁(yè)面到文本元數(shù)據(jù)的編輯，本文只是簡(jiǎn)單的介紹了每項(xiàng)的基本用法。關(guān)于詳細(xì)的用法和函數(shù)列表，可以閱讀官方文檔，或者閱讀GitHub上項(xiàng)目源碼進(jìn)行了解。

責(zé)任編輯：張燕妮來(lái)源：今日頭條