尝试使用已删除的文本编写和比较文本文件,但工作不正常

时间:2017-07-12 21:50:18

标签: python beautifulsoup bs4

我试图编写一个程序,从网页中提取html,然后将其与我保存的以前删除的数据进行比较。如果某些内容发生了变化,它会将新的html保存到文本文件中并通过电子邮件发送给我。问题是它要么偶尔将文本写入文本文件,要么根本不写文本文件,然后即使没有任何改变也会随机给我发电子邮件。我已经玩了两个星期了,似乎无法弄清楚发生了什么。救命啊!

import requests
import smtplib
import bs4
import os

abbvs = ['MCL', 'PFL', 'OPPL', 'FCPL', 'AnyPL', 'NOLS', 'VanWaPL', 'SLCPL', 'ProPL', 'ArapPL']
openurls = open('/home/ian/PythonPrograms/job-scrape/urls', 'r')
urls = openurls.read().strip('\n').split(',')
olddocs = ['oldMCL', 'oldPFL', 'oldOPPL', 'oldFCPL', 'oldAnyPL', 'oldNOLS', 'oldVanWaPL', 'oldSLCPL', 'oldProPL', 'oldArapPL']
newdocs = ['newMCL', 'newPFL', 'newOPPL', 'newFCPL', 'newAnyPL', 'newNOLS', 'newVanWaPL', 'newSLCPL', 'newProPL', 'newArapPL']
bstags = ['#content', '.col-md-12', '#main', '#containedInVSplit', '.col-sm-7', '.statement-left-div', '#main', '#main', '#componentBox', '.list-group.job-listings']

for url in urls: 
    res = requests.get(url)
    res.raise_for_status()
for bstag in bstags:
    currentsoup = bs4.BeautifulSoup(res.text, "lxml")
    newsoup = currentsoup.select(bstag)
for newdoc in newdocs:
    if os.path.isfile('/home/ian/Pythonprograms/job-scrape/libsitehtml/'+newdoc) == False:
        createnew = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+newdoc, 'w')

    file = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+newdoc, 'w')
    file.write(str(newsoup)) 
    file.close()

    new = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+newdoc)
    new = new.read()
for olddoc in olddocs:
    if os.path.isfile('/home/ian/Pythonprograms/job-scrape/libsitehtml/'+olddoc) == False:
        createold = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+olddoc, 'w')

    old = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+olddoc)
    old = old.read()

if str(old) != str(new):
   file = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+olddoc, 'w') 
    file.write(str(new))
    file.close()

    server = smtplib.SMTP('smtp.gmail.com', 587)
    server.ehlo()
    server.starttls()
    server.login('dummyemail', 'password')
    server.sendmail('noreply.job.updates.com', 'myemail', 'Subject: A library\'s jobs page has changed\n' '\n' + 'Here\'s the URL:' + str(url))
    server.quit()
elif str(old) == str(new):
    pass

1 个答案:

答案 0 :(得分:1)

您的代码存在一些问题。主要问题是每个循环都在运行完成,让您有效地只检查最后一个站点。您需要为每组abbvurlbstag运行比较。为此,有一个很好的Python函数叫做zip(),很容易理解。

此外,您不需要存储新删除的数据,因为它可以直接与旧数据进行比较(如果更改,则只更新)。通过这些更改,您的代码可能类似于:

import requests
import smtplib
import bs4
import os

abbvs = ['MCL', 'PFL', 'OPPL', 'FCPL', 'AnyPL', 'NOLS', 'VanWaPL', 'SLCPL', 'ProPL', 'ArapPL']
openurls = open('/home/ian/PythonPrograms/job-scrape/urls', 'r')
urls = openurls.read().strip('\n').split(',')
bstags = ['#content', '.col-md-12', '#main', '#containedInVSplit', '.col-sm-7', '.statement-left-div', '#main', '#main', '#componentBox', '.list-group.job-listings']

for abbv, url, bstag in zip(abbvs, urls, bstags):
    res = requests.get(url)
    res.raise_for_status()
    olddoc = 'old'+abbv
    currentsoup = bs4.BeautifulSoup(res.text, "lxml")
    newsoup = str(currentsoup.select(bstag))

    filepath = '/home/ian/Pythonprograms/job-scrape/libsitehtml/'+olddoc
    if os.path.isfile(filepath):
        with open(filepath) as old:
            oldsoup = old.read()
    else:
        oldsoup = ''

    if newsoup != oldsoup:
        with open(filepath, 'w') as new:
            new.write(newsoup)
        server = smtplib.SMTP('smtp.gmail.com', 587)
        server.ehlo()
        server.starttls()
        server.login('dummyemail', 'password')
        server.sendmail('noreply.job.updates.com', 'myemail', 'Subject: A library\'s jobs page has changed\n' '\n' + 'Here\'s the URL:' + str(url))
        server.quit()

我没有测试过上面的内容,所以它可能包含一些错误。但它应该是从一开始的。此外,您应该考虑尝试使用dict作为键abbvsurls作为值,因为它们紧密相连。