我有一个简单的beautifulsoup脚本,该脚本会定期从页面中抓取数据,并将其保存为json文件。但是,每次运行时,它都会遍历许多相同的URL集,并刮擦许多相同的数据以及发布的任何新数据。如何避免重复?
我尝试腌制已被刮掉的网址,但不知道如何建立逻辑来阻止在刮取中不必要的重复。
for i in urlrange:
urlbase = 'https://www.example.com/press-releases/Pages/default.aspx?page='
targeturl = urlbase+str(i)
req = requests.get(targeturl)
r = req.content
soup = BeautifulSoup(r,'lxml')
for row in soup.find_all('table', class_='t-press'):
for link in row.find_all('a'):
link = link.get('href')
link = 'https://www.example.com' + link
if link not in datalinks:
datalinks.append(link)
#print('New link found!')
else:
continue
pickling_on = open("links_saved.pkl","wb")
pickle.dump(datalinks, pickling_on)
pickling_on.close()
for j in datalinks:
req = requests.get(j)
r = req.content
soup = BeautifulSoup(r,'lxml')
for textdata in soup.find_all('div', class_='content-slim'):
textdata = textdata.prettify()
data.append({j:textdata})
json_name = "Press_Data_{}.json".format(time.strftime("%d-%m-%y"))
with open(json_name,'w') as outfile:
json.dump(data,outfile)
我想抓取数据,而不必遍历脚本已经处理过的网址。
答案 0 :(得分:1)
尝试将链接存储在一组中。
datalinks = [ ]
unique_links = set(datalinks)
这将删除所有重复的链接,因此现在仅处理唯一链接。
答案 1 :(得分:0)
尝试这样的事情:
listwithdups = [ 'url1', 'url2', 'url3', 'url2', 'url4', 'url4' ]
uniqueList = [ i for i in listwithdups if i not in uniqueList ]
破坏列表理解:
listwithdups = [ 'url1', 'url2', 'url3', 'url2', 'url4', 'url4' ]
uniqueList = [] #declaring empty list
for i in listwithdups:
if i not in uniqueList:
uniqueList.append(i)