我试图编写一个程序,从网页中提取html,然后将其与我保存的以前删除的数据进行比较。如果某些内容发生了变化,它会将新的html保存到文本文件中并通过电子邮件发送给我。问题是它要么偶尔将文本写入文本文件,要么根本不写文本文件,然后即使没有任何改变也会随机给我发电子邮件。我已经玩了两个星期了,似乎无法弄清楚发生了什么。救命啊!
import requests
import smtplib
import bs4
import os
abbvs = ['MCL', 'PFL', 'OPPL', 'FCPL', 'AnyPL', 'NOLS', 'VanWaPL', 'SLCPL', 'ProPL', 'ArapPL']
openurls = open('/home/ian/PythonPrograms/job-scrape/urls', 'r')
urls = openurls.read().strip('\n').split(',')
olddocs = ['oldMCL', 'oldPFL', 'oldOPPL', 'oldFCPL', 'oldAnyPL', 'oldNOLS', 'oldVanWaPL', 'oldSLCPL', 'oldProPL', 'oldArapPL']
newdocs = ['newMCL', 'newPFL', 'newOPPL', 'newFCPL', 'newAnyPL', 'newNOLS', 'newVanWaPL', 'newSLCPL', 'newProPL', 'newArapPL']
bstags = ['#content', '.col-md-12', '#main', '#containedInVSplit', '.col-sm-7', '.statement-left-div', '#main', '#main', '#componentBox', '.list-group.job-listings']
for url in urls:
res = requests.get(url)
res.raise_for_status()
for bstag in bstags:
currentsoup = bs4.BeautifulSoup(res.text, "lxml")
newsoup = currentsoup.select(bstag)
for newdoc in newdocs:
if os.path.isfile('/home/ian/Pythonprograms/job-scrape/libsitehtml/'+newdoc) == False:
createnew = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+newdoc, 'w')
file = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+newdoc, 'w')
file.write(str(newsoup))
file.close()
new = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+newdoc)
new = new.read()
for olddoc in olddocs:
if os.path.isfile('/home/ian/Pythonprograms/job-scrape/libsitehtml/'+olddoc) == False:
createold = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+olddoc, 'w')
old = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+olddoc)
old = old.read()
if str(old) != str(new):
file = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+olddoc, 'w')
file.write(str(new))
file.close()
server = smtplib.SMTP('smtp.gmail.com', 587)
server.ehlo()
server.starttls()
server.login('dummyemail', 'password')
server.sendmail('noreply.job.updates.com', 'myemail', 'Subject: A library\'s jobs page has changed\n' '\n' + 'Here\'s the URL:' + str(url))
server.quit()
elif str(old) == str(new):
pass
答案 0 :(得分:1)
您的代码存在一些问题。主要问题是每个循环都在运行完成,让您有效地只检查最后一个站点。您需要为每组abbv
,url
和bstag
运行比较。为此,有一个很好的Python函数叫做zip()
,很容易理解。
此外,您不需要存储新删除的数据,因为它可以直接与旧数据进行比较(如果更改,则只更新)。通过这些更改,您的代码可能类似于:
import requests
import smtplib
import bs4
import os
abbvs = ['MCL', 'PFL', 'OPPL', 'FCPL', 'AnyPL', 'NOLS', 'VanWaPL', 'SLCPL', 'ProPL', 'ArapPL']
openurls = open('/home/ian/PythonPrograms/job-scrape/urls', 'r')
urls = openurls.read().strip('\n').split(',')
bstags = ['#content', '.col-md-12', '#main', '#containedInVSplit', '.col-sm-7', '.statement-left-div', '#main', '#main', '#componentBox', '.list-group.job-listings']
for abbv, url, bstag in zip(abbvs, urls, bstags):
res = requests.get(url)
res.raise_for_status()
olddoc = 'old'+abbv
currentsoup = bs4.BeautifulSoup(res.text, "lxml")
newsoup = str(currentsoup.select(bstag))
filepath = '/home/ian/Pythonprograms/job-scrape/libsitehtml/'+olddoc
if os.path.isfile(filepath):
with open(filepath) as old:
oldsoup = old.read()
else:
oldsoup = ''
if newsoup != oldsoup:
with open(filepath, 'w') as new:
new.write(newsoup)
server = smtplib.SMTP('smtp.gmail.com', 587)
server.ehlo()
server.starttls()
server.login('dummyemail', 'password')
server.sendmail('noreply.job.updates.com', 'myemail', 'Subject: A library\'s jobs page has changed\n' '\n' + 'Here\'s the URL:' + str(url))
server.quit()
我没有测试过上面的内容,所以它可能包含一些错误。但它应该是从一开始的。此外,您应该考虑尝试使用dict
作为键abbvs
和urls
作为值,因为它们紧密相连。