网页抓取,回到库存通知

时间:2018-01-04 12:49:07

标签: python html web beautifulsoup screen-scraping

我想设置一个Python脚本,告诉我产品是否有库存。目前它刮擦下面的网址并解析网站的相关部分,但我无法弄清楚如何采取这个输出变量我已经调用了股票并将其存储为另一个名为stock_history的变量,然后运行另一行询问股票是否相等或不要stock_history

我在尝试在stock_history中存储html数据的同时扫描字符串文字错误时也获得了EOL。有更好的方法吗?

import requests
from datetime import datetime 
from bs4 import BeautifulSoup
import csv
now = datetime.now()
#enter website address
url = requests.get('https://shop.bitmain.com/antminer_s9_asic_bitcoin_miner.htm')

soup = BeautifulSoup(url.content,'html')

stock = (soup.find("div", "buy-now-bar-con"))

stock_history = '<div class="buy-now-bar-con">
<a class="current" href="antminer_s9_asic_bitcoin_miner.htm?flag=overview">Overview</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?flag=specifications">Specification</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?flag=gallery">Gallery</a>
<a class="btn-buy-now" href="javascript:;" style="background:#a7a4a4; cursor:not-allowed;" target="_self" title="sold out!">Coming soon</a>
</div>'


print(stock)

if stock == stock_history 
    print("still not in stock")

1 个答案:

答案 0 :(得分:1)

首先,EOL代表&#34;行结束&#34;如果python不喜欢你如何定义一个字符串或者使用了一些不稳定的字符,那么你通常会得到这个错误。为避免这种情况,您可以在原始代码中对字符串进行三重引用,如下所示:

stock_history = '''<div class="buy-now-bar-con">
<a class="current" href="antminer_s9_asic_bitcoin_miner.htm?
flag=overview">Overview</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?
flag=specifications">Specification</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?flag=gallery">Gallery</a>
<a class="btn-buy-now" href="javascript:;" style="background:#a7a4a4; 
cursor:not-allowed;" target="_self" title="sold out!">Coming soon</a>
</div>'''

这很难看,所以我取消了那个大字符串,因为它没有必要。您应该从stock变量中获取的唯一信息是产品是否已售罄。为此,您可以将bs4.element.Tag转换为str类型,并使用正则表达式来检查&#34;售罄的存在!&#34;子。无论你在做刮痧,处理文本数据,还是执行任何形式的XML或HTML解析,正则表达式都会派上用场,所以我鼓励你阅读它们。

更多信息:https://www.regular-expressions.info/

您可以在此处轻松测试python正则表达式捕获:https://pythex.org/

这里是修改后的代码,它会执行您尝试执行此操作的代码。

import re
import csv
import requests
from datetime import datetime 
from bs4 import BeautifulSoup

def stock_check(url):
    """Function checks url for 'sold out!' substring in url.content"""
    soup = BeautifulSoup(url.content, "lxml") #Need to use lxml parser
    stock = soup.find("div", "buy-now-bar-con") #Check the html tags for sold out/coming soon info.
    stock_status = re.findall(r"(sold out!)", str(stock)) #Returns list of captured substring if exists.
    return stock_status[0] # returns "sold out!" from soup string.

now = datetime.now()

url = requests.get('https://shop.bitmain.com/antminer_s9_asic_bitcoin_miner.htm')

if stock_check(url) == "sold out!":
    print(str(now) + ": Still not in stock...")
else:
    print(str(now) + ": Now in stock!")

尝试一下,如果您有任何问题,请告诉我们!

编辑:OP询问如何定期检查网页并包含电子邮件通知。需要从原始解决方案更改一些内容,例如在userAgent字段中设置requests headers信息。同时切换到html.parser代替lxml BeautifulSoup对象,以正确处理url.content中的javascript。

import re
import time
import smtplib
import requests
from datetime import datetime 
from bs4 import BeautifulSoup

def stock_check(url):
    """Checks url for 'sold out!' substring in buy-now-bar-con"""
    soup = BeautifulSoup(url.content, "html.parser") #Need to use lxml parser
    stock = soup.find("div", "buy-now-bar-con") #Check the html tags for sold out/coming soon info.
    stock_status = re.findall(r"sold out!", str(stock)) #Returns list of captured substring if exists.
    return stock_status # returns "sold out!" from soup string.

def send_email(address, password, message):
    """Send an e-mail to yourself!"""
    server = smtplib.SMTP("smtp.gmail.com", 587) #e-mail server
    server.ehlo()
    server.starttls()
    server.login(address,password) #login
    message = str(message) #message to email yourself
    server.sendmail(address,address,message) #send the email through dedicated server
    return

def stock_check_listener(url, address, password, run_hours):
    """Periodically checks stock information."""
    listen = True # listen boolean
    start = datetime.now() # start time
    while(listen): #while listen = True, run loop
        if "sold out!" in stock_check(url): #check page
            now = datetime.now()
            print(str(now) + ": Not in stock.")
        else:
            message = str(now) + ": NOW IN STOCK!"
            print(message)
            send_email(address, password, message)
            listen = False

        duration = (now - start)
        seconds = duration.total_seconds()
        hours = int(seconds/3600)
        if hours >= run_hours: #check run time
            print("Finished.")
            listen = False

        time.sleep(30*60) #Wait N minutes to check again.    
    return

if __name__=="__main__":

    #Set url and userAgent header for javascript issues.
    page = "https://shop.bitmain.com/antminer_s9_asic_bitcoin_miner.htm"
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
    'Content-Type': 'text/html'}

    #URL request.
    url = requests.get(url=page,
                       headers=headers)

    #Run listener to stream stock checks.
    address = "user@gmail.com" #your email
    password = "user.password" #your email password
    stock_check_listener(url=url,
                         address=address,
                         password=password,
                         run_hours=1) 

现在,程序将启动while循环,定期从网页请求信息。您可以通过更改run_hours变量来设置超时(以小时为单位)。您还可以通过更改N内的stock_check_listener来设置睡眠/等待时间(以分钟为单位)。在这种情况下我使用gmail,如果您在发送电子邮件时收到错误,则需要关注此链接:https://myaccount.google.com/lesssecureapps,并允许安全性较低的应用程序(您的python程序)访问您的gmail帐户。