需要从Web爬网中消除重复内容

时间:2015-11-17 11:26:52

标签: python parsing selenium web-crawler

我是python中的beautifulsoup的新手,我试图从网站中提取某些信息。深层链接,标题和价格。

除了抓取工具提供我想要从输出中删除的重复内容这一事实之外,它工作正常。

以下示例:

 Header: Splendid Imlil: Mount Toubkal Day Trip from Marrakech | Price:  83 |  Deeplink: http://www.isa.com/marrakech-l208/splendid-imlil-mount-toubkal-t41589/ 
 Header: Morocco - The Imperial Cities 7-Day Tour | Price:  653 | Deeplink: http://www.isa.com/fuengirola-l1160/morocco-the-imperial-cities-7-day-tour-t15167/ 
 Header: Ourika Valley Full-Day Private Tour | Price:  27 | Deeplink: http://www.isa.com/marrakech-l208/ourika-valley-full-day-private-tour-lunch-t19152/ 
 Header: Sunday market Had Draa & Oasis of Ain el Hajar | Price:  39 | Deeplink: http://www.isa.com/essaouira-l877/sunday-market-had-draa-oasis-of-ain-el-hajar-t51987/ 
 Header: Marrakech: 4-Day Long Weekend Tour | Price:  646 | Deeplink: http://www.isa.com/marrakech-l208/long-weekend-tour-in-marrakech-t54831/ 
 Header: From Agadir: Marrakech Excursion Full-Day Trip | Price:  113 | Deeplink: http://www.isa.com/agadir-l1413/marrakech-express-bus-and-walking-tour-from-agadir-t28772/ 
 Header: Sahara Desert 4-Day New Years Eve Tour from Marrakech | Price:  422 | Deeplink: http://www.isa.com/marrakech-l208/sahara-desert-4-day-new-years-eve-tour-from-marrakech-t24757/ 

 Header: Essaouira: VIP Gnawa Music Experience Festival Tour | Price:  122 | Deeplink: http://www.isa.com/essaouira-l877/essaouira-vip-gnawa-music-experience-festival-tour-t50983/ 
 Header: Marrakech: Full Day Private Tour | Price:  235 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-full-day-private-tour-t56646/ 
 Header: Marrakech Palmeraie 3-Hour Bike Tour | Price:  79 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-palmeraie-bike-tour-t53282/ 
 Header: Marrakech: 4-Day Desert Safari and Overnight Camp | Price:  484 | Deeplink: http://www.isa.com/marrakech-l208/desert-tour-from-marrakech-t54706/ 
Header: Private Transfer between Marrakech Airport to Palmeraie | Price:  23 | Deeplink: http://www.isa.com/marrakech-l208/private-transfer-between-marrakech-airport-to-palmeraie-t55781/ 

Header: Splendid Imlil: Mount Toubkal Day Trip from Marrakech | Price:  83 | Deeplink: http://www.isa.com/marrakech-l208/splendid-imlil-mount-toubkal-t41589/ 
Header: Morocco - The Imperial Cities 7-Day Tour | Price:  653 | Deeplink: http://www.isa.com/fuengirola-l1160/morocco-the-imperial-cities-7-day-tour-t15167/ 
Header: Ourika Valley Full-Day Private Tour | Price:  27 | Deeplink: http://www.isa.com/marrakech-l208/ourika-valley-full-day-private-tour-lunch-t19152/ 
Header: Sunday market Had Draa & Oasis of Ain el Hajar | Price:  39 | Deeplink: http://www.isa.com/essaouira-l877/sunday-market-had-draa-oasis-of-ain-el-hajar-t51987/ 
Header: Marrakech: 4-Day Long Weekend Tour | Price:  646 | Deeplink: http://www.isa.com/marrakech-l208/long-weekend-tour-in-marrakech-t54831/ 
Header: From Agadir: Marrakech Excursion Full-Day Trip | Price:  113 | Deeplink: http://www.isa.com/agadir-l1413/marrakech-express-bus-and-walking-tour-from-agadir-t28772/ 
Header: Sahara Desert 4-Day New Years Eve Tour from Marrakech | Price:  422 | Deeplink: http://www.isa.com/marrakech-l208/sahara-desert-4-day-new-years-eve-tour-from-marrakech-t24757/ 

Header: Essaouira: VIP Gnawa Music Experience Festival Tour | Price:  122 | Deeplink: http://www.isa.com/essaouira-l877/essaouira-vip-gnawa-music-experience-festival-tour-t50983/ 

Header: Marrakech: Full Day Private Tour | Price:  235 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-full-day-private-tour-t56646/ 
Header: Marrakech Palmeraie 3-Hour Bike Tour | Price:  79 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-palmeraie-bike-tour-t53282/ 

我希望在抓取此内容之前消除重复内容

到目前为止,这是我的逻辑:

    hallo = soup.find_all("article", {"class": "activity-card activity-card-horizontal "})


    for item in hallo:
        headers = item.find_all("h3", {"class": "activity-card-title"})
        for header in headers:
            header_final = header.text.strip()
            #print(header_final)
        prices = item.find_all("span", {"class": "price"})
        for price in prices:
            price_final = price.text.strip().replace(",","")[3:]
            #print(price_final)
        deeplinks = item.find_all("a", {"class": "activity-card-link"})
        for t in set(t.get("href") for t in deeplinks):
            deeplink_final = t
            #print(deeplink_final)



        print("Header: " + header_final + " | " + "Price: " + str(price_final) + " | " + "Deeplink: " + deeplink_final

任何人都可以向我提供反馈如何删除重复项吗?任何反馈都表示赞赏。我试图维持一组结果,但显然我犯了一些我无法弄清楚的错误。

修改

由于反馈意见调整了我的代码:

for item in hallo:
        headers = item.find_all("h3", {"class": "activity-card-title"})
        for header in headers:
            item = header.text.strip()
            if item not in already_printed:
                print(item)
                already_printed.add(item)

        prices = item.find_all("span", {"class": "price"})
        for price in prices:
            item2 = price.text.strip().replace(",","")[3:]
            if item2 not in already_printed:
                print(item2)
                already_printed.add(item2)

它适用于标题项,但对于价格,我收到以下错误消息:

File "C:/Users/hmattu/PycharmProjects/untitled1/Duplicates remove.py", line 52, in trade_spider
prices = item.find_all("span", {"class": "price"})
AttributeError: 'str' object has no attribute 'find_all'

我做错了什么?感谢您的任何反馈

1 个答案:

答案 0 :(得分:0)

而不是在每次迭代时打印每个项目,而是先将它们存储在字典中,然后使用headerurl作为键。 (你也可以使用set())

当您完成迭代hallo列表后,您将逐个打印出字典。

这样,您只会在字典/集中为重复内容保留一个条目。