我正在尝试为此本地Web服务器抓取Current Usage。该数字每秒更新一次,该数字是由随机数生成器生成的值。
当前时间:世界标准时间07:25:16
当前日期:世界标准时间2018-11-28
当前用量:13 kW
这是到目前为止我对Beautifulsoup尝试过的方法:
import requests
from bs4 import BeatifulSoup
import time
def get_count():
url = "http://10.0.0.206/apps/cy8ckit_062_demo/main.html"
# request with fake header, otherwise you will get an 403 HTTP error
r = request.get(url, headers={'User-Agent': Mozilla/5.0})
while True:
print(get_count())
time.sleep(8)
但是,当我运行此脚本时,每8秒就会得到一个“无”输出
以下是Web服务器检查的输出:
当前时间:世界标准时间07:39:42
当前日期UTC 2018-11-28
当前使用量:8 kW
我一直在尝试遵循以下规则:How to scrape real time streaming data with Python?
这是我尝试@ chitown88代码后得到的输出:
Traceback (most recent call last):
File "C:/seniord/csusite/readweb.py", line 14, in <module>
soup = BeautifulSoup(r.text, 'html.parser')
NameError: name 'r' is not defined
尝试使用@ chitown88修改后的代码后,将其作为输出(不显示动态值,但我认为beautifulsoup可解决该问题):
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<link href="../../styles/buttons.css" rel="stylesheet" type="text/css"/>
<title>CE222494 PSoC 6 WICED WiFi Demo</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<script src="../../scripts/general_ajax_script.js" type="text/javascript"></script>
<script type="text/javascript">
/* <![CDATA[ */
function reloadData()
{
do_ajax('/temp_report.html', ajax_handler);
timeoutID = setTimeout('reloadData()', 500);
}
function ajax_handler( result, data )
{
switch( result )
{
case AJAX_PARTIAL_PROGRESS:
break;
case AJAX_STARTING:
break;
case AJAX_FINISHED:
document.getElementById("currentData").innerHTML = data;
break;
case AJAX_NO_BROWSER_SUPPORT:
document.getElementById("currentData").innerHTML = "Failed - your browser does not support this script";
break;
case AJAX_FAILED:
document.getElementById("currentData").innerHTML = "There was a problem retrieving data";
break;
}
}
/* ]]> */
</script>
</head>
<body onload="reloadData()">
<div id="currentData">Retrieving current usage data...
</div>
</body>
</html>
答案 0 :(得分:0)
您的代码不完整。具体来说,1)您实际上并未使用BeautifulSoup做任何事情,2)您的函数未返回任何内容,这就是为什么它显示“ None”的原因
import pandas as pd
import bs4
from requests_html import HTMLSession
import time
def get_count():
url = 'http://10.0.0.206/apps/cy8ckit_062_demo/main.html'
session = HTMLSession()
r = session.get(url)
r.html.render(sleep=5,timeout=8)
soup = bs4.BeautifulSoup(r.text,'html.parser')
data = soup.findAll('div', {'id':'currentData'})[0]
temp_data = data.findAll('p')
current_time = temp_data[0].text
current_date = temp_data[1].text
current_usage = temp_data[2].text
print ('%s\n%s\n%s' %(current_time, current_date, current_usage))
while True:
get_count()
time.sleep(8)
答案 1 :(得分:0)
main.html
是错误的URL,它用于显示来自temp_report.html
(ajax)的数据
import requests
from bs4 import BeatifulSoup
import time
def get_count():
url = "http://10.0.0.206/temp_report.html
# or
# url = "http://10.0.0.206/apps/cy8ckit_062_demo/temp_report.html
# request with fake header, otherwise you will get an 403 HTTP error
r = request.get(url, headers={'User-Agent': Mozilla/5.0})
page_source = r.text
# print(page_source)
soup = BeautifulSoup(page_source, 'html.parser')
print(soup)
# html_body = soup.find('body') # <body>this_text</body>
# print(html_body.text) # this_text
# paragraphs = soup.find_all('p') # <body> <p>p1</p> <p>p2</p> </body>
# for p in paragraphs:
# print(p.text) # p1, p2
while True:
print(get_count())
time.sleep(8)