我是python中的beautifulsoup的新手,我试图从网站中提取某些信息。深层链接,标题和价格。
除了抓取工具提供我想要从输出中删除的重复内容这一事实之外,它工作正常。
以下示例:
Header: Splendid Imlil: Mount Toubkal Day Trip from Marrakech | Price: 83 | Deeplink: http://www.isa.com/marrakech-l208/splendid-imlil-mount-toubkal-t41589/
Header: Morocco - The Imperial Cities 7-Day Tour | Price: 653 | Deeplink: http://www.isa.com/fuengirola-l1160/morocco-the-imperial-cities-7-day-tour-t15167/
Header: Ourika Valley Full-Day Private Tour | Price: 27 | Deeplink: http://www.isa.com/marrakech-l208/ourika-valley-full-day-private-tour-lunch-t19152/
Header: Sunday market Had Draa & Oasis of Ain el Hajar | Price: 39 | Deeplink: http://www.isa.com/essaouira-l877/sunday-market-had-draa-oasis-of-ain-el-hajar-t51987/
Header: Marrakech: 4-Day Long Weekend Tour | Price: 646 | Deeplink: http://www.isa.com/marrakech-l208/long-weekend-tour-in-marrakech-t54831/
Header: From Agadir: Marrakech Excursion Full-Day Trip | Price: 113 | Deeplink: http://www.isa.com/agadir-l1413/marrakech-express-bus-and-walking-tour-from-agadir-t28772/
Header: Sahara Desert 4-Day New Years Eve Tour from Marrakech | Price: 422 | Deeplink: http://www.isa.com/marrakech-l208/sahara-desert-4-day-new-years-eve-tour-from-marrakech-t24757/
Header: Essaouira: VIP Gnawa Music Experience Festival Tour | Price: 122 | Deeplink: http://www.isa.com/essaouira-l877/essaouira-vip-gnawa-music-experience-festival-tour-t50983/
Header: Marrakech: Full Day Private Tour | Price: 235 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-full-day-private-tour-t56646/
Header: Marrakech Palmeraie 3-Hour Bike Tour | Price: 79 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-palmeraie-bike-tour-t53282/
Header: Marrakech: 4-Day Desert Safari and Overnight Camp | Price: 484 | Deeplink: http://www.isa.com/marrakech-l208/desert-tour-from-marrakech-t54706/
Header: Private Transfer between Marrakech Airport to Palmeraie | Price: 23 | Deeplink: http://www.isa.com/marrakech-l208/private-transfer-between-marrakech-airport-to-palmeraie-t55781/
Header: Splendid Imlil: Mount Toubkal Day Trip from Marrakech | Price: 83 | Deeplink: http://www.isa.com/marrakech-l208/splendid-imlil-mount-toubkal-t41589/
Header: Morocco - The Imperial Cities 7-Day Tour | Price: 653 | Deeplink: http://www.isa.com/fuengirola-l1160/morocco-the-imperial-cities-7-day-tour-t15167/
Header: Ourika Valley Full-Day Private Tour | Price: 27 | Deeplink: http://www.isa.com/marrakech-l208/ourika-valley-full-day-private-tour-lunch-t19152/
Header: Sunday market Had Draa & Oasis of Ain el Hajar | Price: 39 | Deeplink: http://www.isa.com/essaouira-l877/sunday-market-had-draa-oasis-of-ain-el-hajar-t51987/
Header: Marrakech: 4-Day Long Weekend Tour | Price: 646 | Deeplink: http://www.isa.com/marrakech-l208/long-weekend-tour-in-marrakech-t54831/
Header: From Agadir: Marrakech Excursion Full-Day Trip | Price: 113 | Deeplink: http://www.isa.com/agadir-l1413/marrakech-express-bus-and-walking-tour-from-agadir-t28772/
Header: Sahara Desert 4-Day New Years Eve Tour from Marrakech | Price: 422 | Deeplink: http://www.isa.com/marrakech-l208/sahara-desert-4-day-new-years-eve-tour-from-marrakech-t24757/
Header: Essaouira: VIP Gnawa Music Experience Festival Tour | Price: 122 | Deeplink: http://www.isa.com/essaouira-l877/essaouira-vip-gnawa-music-experience-festival-tour-t50983/
Header: Marrakech: Full Day Private Tour | Price: 235 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-full-day-private-tour-t56646/
Header: Marrakech Palmeraie 3-Hour Bike Tour | Price: 79 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-palmeraie-bike-tour-t53282/
我希望在抓取此内容之前消除重复内容
到目前为止,这是我的逻辑:
hallo = soup.find_all("article", {"class": "activity-card activity-card-horizontal "})
for item in hallo:
headers = item.find_all("h3", {"class": "activity-card-title"})
for header in headers:
header_final = header.text.strip()
#print(header_final)
prices = item.find_all("span", {"class": "price"})
for price in prices:
price_final = price.text.strip().replace(",","")[3:]
#print(price_final)
deeplinks = item.find_all("a", {"class": "activity-card-link"})
for t in set(t.get("href") for t in deeplinks):
deeplink_final = t
#print(deeplink_final)
print("Header: " + header_final + " | " + "Price: " + str(price_final) + " | " + "Deeplink: " + deeplink_final
任何人都可以向我提供反馈如何删除重复项吗?任何反馈都表示赞赏。我试图维持一组结果,但显然我犯了一些我无法弄清楚的错误。
修改
由于反馈意见调整了我的代码:
for item in hallo:
headers = item.find_all("h3", {"class": "activity-card-title"})
for header in headers:
item = header.text.strip()
if item not in already_printed:
print(item)
already_printed.add(item)
prices = item.find_all("span", {"class": "price"})
for price in prices:
item2 = price.text.strip().replace(",","")[3:]
if item2 not in already_printed:
print(item2)
already_printed.add(item2)
它适用于标题项,但对于价格,我收到以下错误消息:
File "C:/Users/hmattu/PycharmProjects/untitled1/Duplicates remove.py", line 52, in trade_spider
prices = item.find_all("span", {"class": "price"})
AttributeError: 'str' object has no attribute 'find_all'
我做错了什么?感谢您的任何反馈
答案 0 :(得分:0)
而不是在每次迭代时打印每个项目,而是先将它们存储在字典中,然后使用header
或url
作为键。 (你也可以使用set())
当您完成迭代hallo
列表后,您将逐个打印出字典。
这样,您只会在字典/集中为重复内容保留一个条目。