如何在Python中从网页下载所有图像？代码示例

2021年11月16日18:59:30 发表评论 1,051 次浏览

Python如何从网页下载图像？本文带你了解如何使用requests和 BeautifulSoup 库在 Python 中从单个网页中提取和下载图像。

你是否曾经想下载某个网页上的所有图像？Python如何从网页下载所有图像？在本教程中，你将学习如何构建一个 Python 抓取器，从给定 URL 的网页中检索所有图像，并使用requests和BeautifulSoup库下载它们。

Python从网页下载图像示例介绍：首先，我们需要很多依赖项，让我们安装它们：

pip3 install requests bs4 tqdm

打开一个新的 Python 文件并导入必要的模块：

import requests
import os
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin, urlparse

Python如何从网页下载图像？首先，让我们创建一个 URL 验证器，以确保传递的 URL 是有效的，因为有些网站将编码数据放在 URL 的位置，因此我们需要跳过这些：

def is_valid(url):
    """
    Checks whether `url` is a valid URL.
    """
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)

urlparse()函数将一个 URL 解析为六个部分，我们只需要查看netloc（域名）和 scheme（协议）是否存在即可。

其次，我将编写获取网页所有图像 URL 的核心函数：

def get_all_images(url):
    """
    Returns all image URLs on a single `url`
    """
    soup = bs(requests.get(url).content, "html.parser")

网页的 HTML 内容在soupobject 中，要提取HTML 中的所有img标签，我们需要使用soup.find_all("img")方法，让我们看看它的作用：

    urls = []
    for img in tqdm(soup.find_all("img"), "Extracting images"):
        img_url = img.attrs.get("src")
        if not img_url:
            # if img does not contain src attribute, just skip
            continue

这将检索所有img元素作为 Python 列表。

Python从网页下载所有图像：我将它包装在一个tqdm对象中只是为了打印进度条。要获取img标签的 URL ，有一个src属性。但是，有些标签不包含src属性，我们使用上面的continue语句跳过这些标签。

现在我们需要确保 URL 是绝对的：

        # make the URL absolute by joining domain with the URL that is just extracted
        img_url = urljoin(url, img_url)

有一些 URL 包含我们不喜欢的HTTP GET键值对（以类似"/image.png?c=3.2.5" 结尾），让我们删除它们：

        try:
            pos = img_url.index("?")
            img_url = img_url[:pos]
        except ValueError:
            pass

我们得到了'?'的位置字符，然后删除它后面的所有内容，如果没有，它会引发ValueError，这就是我将它包装在try/except块中的原因（当然你可以以更好的方式实现它，如果是这样，请与我们分享下面的评论）。

现在让我们确保每个 URL 都有效并返回所有图像 URL：

        # finally, if the url is valid
        if is_valid(img_url):
            urls.append(img_url)
    return urls

Python从网页下载图像示例介绍：现在我们有了一个抓取所有图片 URL 的函数，我们需要一个函数来使用 Python 从 web 下载文件，我从本教程中引入了以下函数：

def download(url, pathname):
    """
    Downloads a file given an URL and puts it in the folder `pathname`
    """
    # if path doesn't exist, make that path dir
    if not os.path.isdir(pathname):
        os.makedirs(pathname)
    # download the body of response by chunk, not immediately
    response = requests.get(url, stream=True)
    # get the total file size
    file_size = int(response.headers.get("Content-Length", 0))
    # get the file name
    filename = os.path.join(pathname, url.split("/")[-1])
    # progress bar, changing the unit to bytes instead of iteration (default by tqdm)
    progress = tqdm(response.iter_content(1024), f"Downloading {filename}", total=file_size, unit="B", unit_scale=True, unit_divisor=1024)
    with open(filename, "wb") as f:
        for data in progress.iterable:
            # write data read to the file
            f.write(data)
            # update the progress bar manually
            progress.update(len(data))

复制上述函数基本上采用要下载的文件url和将该文件保存到的文件夹的路径名。

相关： 如何在 Python 中将 HTML 表转换为 CSV 文件。

最后，这是主要功能：

def main(url, path):
    # get all images
    imgs = get_all_images(url)
    for img in imgs:
        # for each image, download it
        download(img, path)

Python从网页下载所有图像：从该页面获取所有图像 URL 并逐一下载。让我们测试一下：

main("https://yandex.com/images/", "yandex-images")

这将从该 URL 下载所有图像并将它们存储在将自动创建的文件夹“yandex-images”中。

Python如何从网页下载图像？但请注意，有些网站使用 Javascript 加载数据，在这种情况下，你应该使用requests_html 库，我已经制作了另一个脚本，对原始脚本进行了一些调整并处理 Javascript 渲染，请在此处查看。

好的，我们完成了！以下是你可以实施以扩展代码的一些想法：

提取网页上的所有链接并下载每个链接上的所有图像。
下载给定网站上的每个 PDF 文件。
使用多线程加速下载（因为这是一个繁重的 IO 任务）。
使用代理来防止某些网站阻止你的 IP 地址。

发表评论取消回复

登录 注册 找回密码

登录注册找回密码