我对抓取网页还是陌生的。我的代码试图获取网站的时间。我找到了位置,并尝试使用xpath来获取text()。但是我的代码总是返回“ []”。我有想念吗?
# -*- coding: utf-8 -*-
import urllib
from bs4 import BeautifulSoup
from lxml import etree
from lxml import html
import requests
headers= { 'User-Agent' : 'User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36' }
tree = requests.get('https://www.time.gov/',headers=headers).content#.decode('utf-8')
doc_tree = etree.HTML(tree)
links = doc_tree.xpath('//div[@id="lzTextSizeCache"]/div[@class="lzswftext"]/text()')
print links
html代码的位置是:
<div class="lzswftext" style="padding: 0px; overflow: visible; width: auto; height: auto; font-weight: bold; font-style: normal; font-family: Arial, Verdana; font-size: 50px; white-space: pre; display: none;">09:37:26 a.m. </div>
答案 0 :(得分:0)
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
<button id="btn">load</button>
<span id="#test"></span>
尝试使用运行Java脚本的无头浏览器,您可能还需要在代码中包含一些 delays 才能完全呈现页面。例如Puppeteer或Selenium
答案 1 :(得分:0)
您没有时间,因为该请求没有时间:
那是因为网页再次请求获取时间。在这种情况下,请求为“ https://www.time.gov/actualtime.cgi?disablecache=1546870424051&lzbc=wr1d55”,它将获得以下html:
<timestamp time="1546870996756222" delay="1545324126332171"/>
有一些JavaScript代码可以将该时间戳转换为日期,您可以使用python模拟它:
In [28]: import requests
In [29]: from datetime import datetime
In [30]: res = requests.get('https://www.time.gov/actualtime.cgi?disablecache=1546870424051&__lzbc__=wr1d55')
2019-01-07 09:34:15 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.time.gov:443
2019-01-07 09:34:16 [urllib3.connectionpool] DEBUG: https://www.time.gov:443 "GET /actualtime.cgi?disablecache=1546870424051&__lzbc__=wr1d55 HTTP/1.1" 200 None
In [31]: from bs4 import BeautifulSoup
...:
In [32]: soup = BeautifulSoup(res.text, 'html.parser')
In [34]: soup.timestamp['time']
Out[34]: '1546871656757021'
In [35]: ts = soup.timestamp['time']
In [38]: ts = int(soup.timestamp['time'])
In [39]: ts /= 1000000 # because timestamp is in microseconds
In [40]: print(datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))
...:
2019-01-07 14:34:16
要获取您所在区域的时间,请阅读:Convert UTC datetime string to local datetime with Python。
这可能是一个过于复杂的解决方案,您也可以使用Selenium或scrapy + splash之类的东西来获得与浏览器相同的结果。