Question

我正在抓取一个网站。但是，我想创建一个代码，该代码将不断抓取网站并在数据更改时进行打印。如果数据没有变化，则保持不变。基本上，这意味着我不必继续单击运行即可查看数据是否已更改。

我尝试执行while循环，但不知道如何包含我在线接收的数据。

import urllib
import urllib.request

from bs4 import BeautifulSoup

theurl = 'xyz'
thepage = urllib.request.urlopen(theurl)

soup = BeautifulSoup(thepage, 'html.parser')

data = soup.find('div' , ( 'class' , 'sticky')). text

print(data)

Answer 1

类似的事情可能会完成工作：

import urllib.request
import time
from bs4 import BeautifulSoup
theurl = 'http://example.com'
# first iteration
thepage = urllib.request.open(theurl)
lastsoup = thissoup = BeautifulSoup(thepage, 'html.parser')
data = soup.find('div' , ( 'class' , 'sticky')). text
print(data)
while True:
    thepage = urllib.request.open(theurl)
    thissoup = BeautifulSoup(thepage, 'html.parser')
    if thissoup != lastsoup:
        data = soup.find('div' , ( 'class' , 'sticky')). text
        print(data)
    time.sleep(30) # sleep 30 seconds before looping

Answer 2

此脚本可以帮助您入门。脚本每隔1秒就会抓取页面并检查更改。如果有更改，则返回旧值和新值：

from bs4 import BeautifulSoup
import requests
from time import sleep

url = 'https://www.random.org/integers/?num=1&min=1&max=2&col=5&base=10&format=html&rnd=new'

def get_data(url):
    return BeautifulSoup(requests.get(url).text, 'lxml')

def watch(url, seconds=1):
    soup = get_data(url)
    old_data = soup.select_one('pre.data').text.strip()
    while True:
        sleep(seconds)
        soup = get_data(url)
        data = soup.select_one('pre.data').text.strip()
        if data != old_data:
            yield old_data, data
        old_data = data

for old_val, new_val in watch(url):
    print('Data changed! Old value was {}, new value is {}'.format(old_val, new_val))

打印（例如）：

Data changed! Old value was 1, new value is 2
Data changed! Old value was 2, new value is 1
Data changed! Old value was 1, new value is 2
Data changed! Old value was 2, new value is 1
Data changed! Old value was 1, new value is 2
Data changed! Old value was 2, new value is 1

...and so on.

您需要更改URL并根据需要选择正确的HTML元素。

如何创建一个while循环来连续检测抓取的数据是否发生变化？

2 个答案: