Question

例如，我希望获得论坛http://www.xossip.com/showthread.php?t=1384077的所有图片的链接。

当我检查图片（论坛帖子中的大图片）时，他们在commomn <img src="http://pzy.be/i/5/17889.jpg" border="0" alt="">中有类似的内容。

该程序应该列出所需图像的所有URL。如果可能甚至下载它们。

我尝试了一些代码但却被卡住了。

import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page)
        sourcecode= requests.get(url)
        plaintext = sourcecode.text
        soup = BeautifulSoup(plaintext)
        for link in soup.findAll('img src'):
            print (link)
        page += 1
spider(1)

修改我想在论坛中的图像，但我想避免所有那些小缩略图，徽标，图标等。我观察到我需要的所有图像都有这种格式<img src="http://pzy.be/i/5/17889.jpg" border="0" alt=""> 所以我需要以上格式的图像的所有链接，所以我需要程序遍历论坛的所有页面，用src，border = 0，alt精炼图像，最后打印所有图像网址，如pzy.be/ I / 5 / 452334.jpg

Answer 1

尝试使用tag.get('src')代替soup.findAll('img src')：

import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page)
        sourcecode= requests.get(url)
        plaintext = sourcecode.text
        soup = BeautifulSoup(plaintext)

        for tag in soup.findAll('img'): 
            print(tag.get('src'))   # use `tag.get('src')` in this case

        page += 1
spider(1)

请查看the document了解详情。

如果您需要下载它们，您还可以使用requests下载图像的内容，并将其写入文件。这是一个演示：

import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page)
        sourcecode= requests.get(url)
        plaintext = sourcecode.text
        soup = BeautifulSoup(plaintext)

        for tag in soup.findAll('img'):
            link = tag.get('src')  # get the link

            # Check if the tag is in expect format
            del tag['src']
            if tag.attrs != {';': '', 'alt': '', 'border': '0'}:
                continue

            filename = link.strip('/').rsplit('/', 1)[-1]  # to get the correct file name

            image = requests.get(link).content  # use requests to get the content of the images
            with open(filename, 'wb') as f:
                f.write(image)  # write the image into a file

        page += 1
spider(1)

如何使用Beautiful soup

1 个答案: