使用Python的HTML链接的唯一列表

时间:2018-02-28 15:00:18

标签: python html python-3.x

所以我有一个脚本从网站中提取所有链接,我认为转换为列表可以确保我只返回唯一链接,但输出中仍然有重复(即&#39) ; www.commerce.gov /'和' www.commerce.gov')代码没有拾取尾随字符。以下是我的代码。任何帮助表示赞赏。感谢。

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import csv

req = Request("https://www.census.gov/programs-surveys/popest.html")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")

prettyhtml = soup.prettify()
Html_file = open("U:\python_intro\popest_html.txt","w")
Html_file.write(prettyhtml)
Html_file.close()

links = []
for link in soup.findAll('a', attrs={'href': re.compile(r'^(?:http|ftp)s?://')}):
    links.append(link.get('href'))

links = set(links)

myfile = "U:\python_stuff\links.csv"

with open(myfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for a in links:
    writer.writerow([a])

1 个答案:

答案 0 :(得分:3)

  1. 您的意思是“转换为集合”而不是列表。

  2. 您可以删除任何可能的尾随'/'

    links.append(link.get('href').rstrip('/'))
    
  3. 或者甚至更好,从头开始构建一个集合:

    links = set()
    for link in soup.findAll('a', attrs={'href': re.compile(r'^(?:http|ftp)s?://')}):
        links.add(link.get('href').rstrip('/'))