我试图抓取此论坛的图片(或图片链接)(http://www.xossip.com/showthread.php?t=1384077)。我尝试过美味的汤4,这是我试过的代码:
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page)
sourcecode= requests.get(url)
plaintext = sourcecode.text
soup = BeautifulSoup(plaintext)
for link in soup.findAll('a',{'class': 'alt1'}):
src = link.get('src')
print(src)
page += 1
spider(1)
我应该如何纠正它,以便获得像pzy.be/example
?
答案 0 :(得分:0)
好的,所以我通过获取所有#post_message_*
div然后从每个div中获取图像来做到这一点。
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page)
sourcecode= requests.get(url)
plaintext = sourcecode.text
soup = BeautifulSoup(plaintext)
divs = soup.findAll('div', id=lambda d: d and d.startswith('post_message_'))
for div in divs:
src = div.find('img')['src']
if src.startswith('http'): # b/c it could be a smilie or something like that
print(src)
page += 1
spider(1)
答案 1 :(得分:0)
最简单的方法是只请求每个页面并过滤img标记:
from bs4 import BeautifulSoup
from requests import get
import re
def get_wp():
start_url = "http://www.xossip.com/showthread.php?t=1384077&page={}"
for i in range(73):
r = get(start_url.format(i))
soup = BeautifulSoup(r.content)
for img in (i["src"] for i in soup.find_all("img", src=re.compile("http://pzy.be.*.jpg"))):
yield img