如何在为HTML链接抓取网站时防止Python代码中的重复链接

时间:2017-12-02 15:35:56

标签: python-3.x

我有以下python代码。但是,我在从结果中删除重复链接时遇到问题。

search_results_links = []
for i in range(len(search_results)):
                if search_results[i]['href'] == "":
                                continue
                elif (search_results[i]['href'][0] == "/"):
                                search_results_links.append("https://www.census.gov"+search_results[i]['href'])
                elif (search_results[i]['href'][0] == "#") :
                                continue
                elif (search_results[i]['href'][0] == "j") :
                                continue
                else:
                                search_results_links.append(search_results[i]['href'])

# Remove duplicates.
search_results_links.sort()
search_results_links2 = []

for i in range(len(search_results_links)):
                if search_results_links[i][:-1] == search_results_links[i - 1]:
                                continue
                else:
                                search_results_links2.append(search_results_links[i])

如何更新此代码以仅提取唯一链接?

1 个答案:

答案 0 :(得分:0)

不使用列表存储所有链接,而是使用集合。

考虑到代码中的其他所有内容都运行正常,如果您首先在集合中执行查找,然后将该链接附加到集合,则不需要删除重复项。像这样:

search_results_links = set()
for i in range(len(search_results)):
  if search_results[i]['href'] == "":
    continue
  elif (search_results[i]['href'][0] == "/"):
    if "https://www.census.gov"+search_results[i]['href'] not in search_results_links:
      search_results_links.add("https://www.census.gov"+search_results[i]['href'])
  elif (search_results[i]['href'][0] == "#") :
    continue
  elif (search_results[i]['href'][0] == "j") :
    continue
  else:
    if search_results[i]['href'] not in search_results_links:
      search_results_links.add(search_results[i]['href'])