如何在Python中从PDF中提取图像？代码实现示例

2021年11月17日03:51:36 发表评论 1,093 次浏览

本文带你了解如何使用 PyMuPDF 和 Pillow 库在 Python 中从 PDF 文件中提取和保存图像。

Python如何从PDF中提取图像？在本教程中，我们将编写一个 Python 代码，使用PyMuPDF和Pillow库从 PDF 文件中提取图像并将它们保存在本地磁盘上。

Python从PDF中提取图像代码示例：使用 PyMuPDF，你可以访问 PDF、XPS、OpenXPS、epub 和许多其他扩展。它应该可以在所有平台上运行，包括 Windows、Mac OSX 和 Linux。

让我们和 Pillow 一起安装它：

pip3 install PyMuPDF Pillow

打开一个新的 Python 文件，让我们开始吧。首先，让我们导入库：

import fitz # PyMuPDF
import io
from PIL import Image

我将用这个 PDF 文件来测试这个，但是你可以随意携带 PDF 文件并将它放在你当前的工作目录中，让我们将它加载到库中：

# file path you want to extract images from
file = "1710.05006.pdf"
# open the file
pdf_file = fitz.open(file)

由于我们想从所有页面中提取图像，我们需要遍历所有可用页面并获取每个页面上的所有图像对象，以下Python从PDF中提取图像代码示例执行此操作：

# iterate over PDF pages
for page_index in range(len(pdf_file)):
    # get the page itself
    page = pdf_file[page_index]
    image_list = page.getImageList()
    # printing number of images found in this page
    if image_list:
        print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
    else:
        print("[!] No images found on page", page_index)
    for image_index, img in enumerate(page.getImageList(), start=1):
        # get the XREF of the image
        xref = img[0]
        # extract the image bytes
        base_image = pdf_file.extractImage(xref)
        image_bytes = base_image["image"]
        # get the image extension
        image_ext = base_image["ext"]
        # load it to PIL
        image = Image.open(io.BytesIO(image_bytes))
        # save it to local disk
        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

相关： 如何在 Python 中将 PDF 转换为图像。

Python如何从PDF中提取图像？我们使用getImageList()方法将所有可用的图像对象作为该特定页面上的元组列表列出。要获取图像对象索引，我们只需获取返回的元组的第一个元素。

之后，我们使用extractImage()以字节为单位返回图像以及图像扩展名等附加信息的方法。

最后，我们将图像字节转换为 PIL 图像实例，并使用save()接受文件指针作为参数的方法将其保存到本地磁盘，我们只是使用相应的页面和图像索引命名图像。

运行脚本后，我得到以下输出：

[!] No images found on page 0
[+] Found a total of 3 images in page 1
[+] Found a total of 3 images in page 2
[!] No images found on page 3
[!] No images found on page 4

图像也保存在当前目录中：

结论

好的，我们已经成功地从该 PDF 文件中提取了图像，而没有损失图像质量。有关库如何工作的更多信息，我建议你查看文档。

(adsbygoogle = window.adsbygoogle || []).push({}); 结论

发表评论取消回复

登录 注册 找回密码

结论

登录注册找回密码