例如,我希望获得论坛http://www.xossip.com/showthread.php?t=1384077的所有图片的链接。
当我检查图片(论坛帖子中的大图片)时,他们在commomn <img src="http://pzy.be/i/5/17889.jpg" border="0" alt="">
中有类似的内容。
该程序应该列出所需图像的所有URL。如果可能甚至下载它们。
我尝试了一些代码但却被卡住了。
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page)
sourcecode= requests.get(url)
plaintext = sourcecode.text
soup = BeautifulSoup(plaintext)
for link in soup.findAll('img src'):
print (link)
page += 1
spider(1)
修改
我想在论坛中的图像,但我想避免所有那些小缩略图,徽标,图标等。我观察到我需要的所有图像都有这种格式<img src="http://pzy.be/i/5/17889.jpg" border="0" alt="">
所以我需要以上格式的图像的所有链接,所以我需要程序遍历论坛的所有页面,用src,border = 0,alt精炼图像,最后打印所有图像网址,如pzy.be/ I / 5 / 452334.jpg
答案 0 :(得分:1)
尝试使用tag.get('src')
代替soup.findAll('img src')
:
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page)
sourcecode= requests.get(url)
plaintext = sourcecode.text
soup = BeautifulSoup(plaintext)
for tag in soup.findAll('img'):
print(tag.get('src')) # use `tag.get('src')` in this case
page += 1
spider(1)
请查看the document了解详情。
如果您需要下载它们,您还可以使用requests
下载图像的内容,并将其写入文件。这是一个演示:
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page)
sourcecode= requests.get(url)
plaintext = sourcecode.text
soup = BeautifulSoup(plaintext)
for tag in soup.findAll('img'):
link = tag.get('src') # get the link
# Check if the tag is in expect format
del tag['src']
if tag.attrs != {';': '', 'alt': '', 'border': '0'}:
continue
filename = link.strip('/').rsplit('/', 1)[-1] # to get the correct file name
image = requests.get(link).content # use requests to get the content of the images
with open(filename, 'wb') as f:
f.write(image) # write the image into a file
page += 1
spider(1)