请耐心等待。我是Python的新手 - 但有很多乐趣。我正在尝试编写一个抓取旅行网站结果的网络抓取工具。我设法从主页面中提取所有相关链接。现在我希望Python遵循每个链接并从每个页面收集信息。但我被卡住了。希望你能给我一个提示。
这是我的代码:
import requests
from bs4 import BeautifulSoup
import urllib, collections
Spider =1
def trade_spider(max_pages):
RegionIDArray = {737: "London"}
for reg in RegionIDArray:
page = -1
r = requests.get("https://www.viatorcom.de/London/d" +str(reg) +"&page=" + str(page) , verify = False)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("h2", {"class": "mtm mbn card-title"})
for item in g_data:
Deeplink = item.find_all("a")
for t in set(t.get("href") for t in Deeplink):
Deeplink_final = t
print(Deeplink_final) #The output shows all the links that I would like to follow and gather information from.
trade_spider(1)
输出:
/de/7132/London-attractions/Stonehenge/d737-a113
/de/7132/London-attractions/Tower-of-London/d737-a93
/de/7132/London-attractions/London-Eye/d737-a1400
/de/7132/London-attractions/Thames-River/d737-a1410
输出显示我想要关注的所有链接并从中收集信息。
我的代码中的下一步:
import requests
from bs4 import BeautifulSoup
import urllib, collections
Spider =1
def trade_spider(max_pages):
RegionIDArray = {737: "London"}
for reg in RegionIDArray:
page = -1
r = requests.get("https://www.viatorcom.de/London/d" +str(reg) +"&page=" + str(page) , verify = False)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("h2", {"class": "mtm mbn card-title"})
for item in g_data:
Deeplink = item.find_all("a")
for t in set(t.get("href") for t in Deeplink):
Deeplink_final = t
trade_spider(1)
def trade_spider2(max_pages):
r = requests.get("https://www.viatorcom.de" + Deeplink_final, verify = False)
soup = BeautifulSoup(r.content, "lxml")
print(soup)
trade_spider2(9)
我想将初始抓取的输出附加到我的第二个请求中。但这不起作用。希望你可以给我一个提示。
答案 0 :(得分:1)
这应该有所帮助。
import requests
from bs4 import BeautifulSoup
import urllib, collections
Spider =1
def trade_spider2(Deeplink_final):
r = requests.get("https://www.viatorcom.de" + Deeplink_final, verify = False)
soup = BeautifulSoup(r.content, "lxml")
print(soup)
def trade_spider(max_pages):
RegionIDArray = {737: "London"}
for reg in RegionIDArray:
page = -1
r = requests.get("https://www.viatorcom.de/London/d" +str(reg) +"&page=" + str(page) , verify = False)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("h2", {"class": "mtm mbn card-title"})
for item in g_data:
Deeplink = item.find_all("a")
for Deeplink_final in set(t.get("href") for t in Deeplink):
trade_spider2(Deeplink_final)
trade_spider(1)