我正在抓取一个网站。但是,我想创建一个代码,该代码将不断抓取网站并在数据更改时进行打印。如果数据没有变化,则保持不变。基本上,这意味着我不必继续单击运行即可查看数据是否已更改。
我尝试执行while循环,但不知道如何包含我在线接收的数据。
import urllib
import urllib.request
from bs4 import BeautifulSoup
theurl = 'xyz'
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, 'html.parser')
data = soup.find('div' , ( 'class' , 'sticky')). text
print(data)
答案 0 :(得分:0)
类似的事情可能会完成工作:
import urllib.request
import time
from bs4 import BeautifulSoup
theurl = 'http://example.com'
# first iteration
thepage = urllib.request.open(theurl)
lastsoup = thissoup = BeautifulSoup(thepage, 'html.parser')
data = soup.find('div' , ( 'class' , 'sticky')). text
print(data)
while True:
thepage = urllib.request.open(theurl)
thissoup = BeautifulSoup(thepage, 'html.parser')
if thissoup != lastsoup:
data = soup.find('div' , ( 'class' , 'sticky')). text
print(data)
time.sleep(30) # sleep 30 seconds before looping
答案 1 :(得分:0)
此脚本可以帮助您入门。脚本每隔1秒就会抓取页面并检查更改。如果有更改,则返回旧值和新值:
from bs4 import BeautifulSoup
import requests
from time import sleep
url = 'https://www.random.org/integers/?num=1&min=1&max=2&col=5&base=10&format=html&rnd=new'
def get_data(url):
return BeautifulSoup(requests.get(url).text, 'lxml')
def watch(url, seconds=1):
soup = get_data(url)
old_data = soup.select_one('pre.data').text.strip()
while True:
sleep(seconds)
soup = get_data(url)
data = soup.select_one('pre.data').text.strip()
if data != old_data:
yield old_data, data
old_data = data
for old_val, new_val in watch(url):
print('Data changed! Old value was {}, new value is {}'.format(old_val, new_val))
打印(例如):
Data changed! Old value was 1, new value is 2
Data changed! Old value was 2, new value is 1
Data changed! Old value was 1, new value is 2
Data changed! Old value was 2, new value is 1
Data changed! Old value was 1, new value is 2
Data changed! Old value was 2, new value is 1
...and so on.
您需要更改URL
并根据需要选择正确的HTML元素。