我写了一个简单的脚本,虽然在刮取网站时它可以工作,但是当尝试使其不刮取重复项时,它却无法工作。我认为使其不刮掉重复项的逻辑是:
import requests
import time
from bs4 import BeautifulSoup
import sys
f = open("links.txt", "a")
list_=[]
while True:
try:
URL = f'WEBSITEURL.COM'
page = requests.get(URL)
time.sleep(1)
soup = BeautifulSoup(page.text, 'html.parser')
data = soup.findAll('div',attrs={'class':'card-content'})
for div in data:
links = div.findAll('a')
for a in links:
if a not in list_:
f.write(a['href'])
f.write('\n')
print (a['href'])
elif:
continue
except Exception as e:
print('something went wrong')
#continue
答案 0 :(得分:1)
在Python中,set
是用于维护非重复记录的最佳内置数据结构。对于您的情况,请首先使用所有链接更新该集合,然后将链接写入文件。
import requests
import time
from bs4 import BeautifulSoup
import sys
list_=set()
while True:
try:
URL = f'WEBSITEURL.COM'
page = requests.get(URL)
time.sleep(1)
soup = BeautifulSoup(page.text, 'html.parser')
data = soup.findAll('div',attrs={'class':'card-content'})
for div in data:
links = div.findAll('a')
list_.update(links)
except Exception as e:
print('something went wrong')
#continue
with open("links.txt", "w") as f:
f.write("\n".join(list_))
如果您仍然遇到something went wrong
错误,那么它一定与链接无关,它在您的保护代码中。