BeautifulSoup和if / else语句

时间:2017-12-16 05:52:13

标签: python beautifulsoup

我正在学习如何使用BeautifulSoup,并且在我编写的循环中遇到了双重打印的问题。

非常感谢任何见解!

for link in soup.find_all('a'):
#if contains cointelegraph/news/
#if ('https://cointelegraph.com/news/' in link.get('href')):
url = link.get('href')                          #local var store url
if '/news/' in url:
    print(url)
    print(count)
    count += 1

if count == 5:
    break

示例:

    https://cointelegraph.com/news/woman-in-denmark-imprisoned-for-hiring-hitman-using-bitcoin
0
https://cointelegraph.com/news/ethereum-price-hits-all-time-high-of-750-following-speed-boost
1
https://cointelegraph.com/news/ethereum-price-hits-all-time-high-of-750-following-speed-boost
2
https://cointelegraph.com/news/senior-vp-says-ebay-seriously-considering-bitcoin-integration
3
https://cointelegraph.com/news/senior-vp-says-ebay-seriously-considering-bitcoin-integration
4

输出:

f1 = CreateProjectForm(prefix='f1')
f2 = CreateProjectForm(prefix='f2')

出于某种原因,我的代码会继续打印两次相同的网址...

1 个答案:

答案 0 :(得分:0)

根据您的代码和提供的链接,BeautifulSoup find_all搜索的结果似乎有重复。需要检查html结构以查看返回重复项的原因(检查find_all搜索选项以过滤documentation中的某些重复项。但是如果您想要快速修复并希望从打印中删除重复项结果您可以使用带有如下设置的修改循环来跟踪看到的条目(基于this)。

In [78]: l = [link.get('href') for link in soup.find_all('a') if '/news/' in link.get('href')]

In [79]: any(l.count(x) > 1 for x in l)                                                                                                              
Out[79]: True

以上输出显示列表中存在重复项。现在删除它们使用像

这样的东西
seen = set()                                                                                                                                

for link in soup.find_all('a'):                                                                                                             

    lhref = link.get('href')
    if '/news/' in lhref and lhref not in seen:
        print lhref
        seen.add(lhref)