删除Python中的重复URL(非列表)

时间:2017-03-06 08:43:35

标签: python python-3.x loops web-scraping duplicates

我需要帮助删除输出中的重复网址。如果可能的话,我会尝试代表它,以便我不必将所有内容都放在列表中。我觉得它可以通过一些逻辑陈述来实现,只是不确定如何实现它。使用Python 3.6。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from urllib.parse import urljoin as join

my_url = 'https://www.census.gov/programs-surveys/popest.html'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

filename = "LinkScraping.csv"
f = open(filename, "w")
headers = "Web_Links\n"
f.write(headers)

links = page_soup.findAll('a')

for link in links:
    web_links = link.get("href")
    ab_url = join(my_url, web_links)
    print(ab_url)
        if ab_url:
        f.write(str(ab_url) + "\n")

f.close()

1 个答案:

答案 0 :(得分:1)

除非您想要写入文件并反复重新读取,否则无法使用某种数据结构就无法实现这一目标(这比使用内存数据结构要好得多) )。

使用set

.
.
.

urls_set = set()

for link in links:
    web_links = link.get("href")
    ab_url = join(my_url, web_links)
    print(ab_url)
    if ab_url and ab_url not in urls_set:
        f.write(str(ab_url) + "\n")
        urls_set.add(ab_url)