Question

此代码为我提供了重复的网址，如何过滤它们

sg = []
for url in soup.find_all('a', attrs={'href': re.compile("^https://www.somewebsite")}):
    print(url['href'])
    sg.append(url['href'])
print(sg)

Answer 1

您可以检查列表中是否已插入url

sg = []
for url in soup.find_all('a', attrs={'href': re.compile("^https://www.somewebsite")}):
    href = url['href'])
    print(href)
    if href not in sg:
        sg.append(href)
print(sg)

Answer 2

您可以使用set代替list

sg = set()
for url in soup.find_all('a', attrs={'href': re.compile("^https://www.somewebsite")}):
    print(url['href'])
    sg.add(url['href'])
print(sg)

Answer 3

使用设置代替列表，即可解决此问题。

sg = set()
for url in soup.find_all('a', attrs={'href': re.compile("^https://www.somewebsite")}):
    print(url['href'])
    sg.add(url['href'])
print(sg)

在python列表中删除重复的URL

3 个答案: