尝试获取某些元素

时间:2019-04-22 06:19:20

标签: python html xpath

我是Python中的lxml模块的新手。 我正在尝试从以下网站解析数据:https://weather.com/weather/tenday/l/USCA1037:1:US

我正在尝试获取以下文字:

<span classname="narrative" class="narrative">
  Cloudy. Low 49F. Winds WNW at 10 to 20 mph.
</span>

但是,我把xpath弄混了。

确切地说,此行的位置是

//*[@id="twc-scrollabe"]/table/tbody/tr[4]/td[2]/span

我尝试了以下方法

import requests

import lxml.html

from lxml import etree



html = requests.get("https://weather.com/weather/tenday/l/USCA1037:1:US")

element_object = lxml.html.fromstring(html.content)  # htmlelement object returns bytes
  # element_object has root of <html>

table = element_object.xpath('//div[@class="twc-table-scroller"]')[0]
day_of_week = table.xpath('.//span[@class="date-time"]/text()')  # returns list of items from "dates-time"
dates = table.xpath('.//span[@class="day-detail clearfix"]/text()')

td = table.xpath('.//tbody/tr/td/span[contains(@class, "narrative")]')
print td

  # print td displays an empty list.  

我希望我的程序还解析“多云。低49F。以10到20英里/小时的速度向西行驶NW”。

请帮助...

2 个答案:

答案 0 :(得分:0)

有些<td>的描述中有title=

import requests
import lxml.html

html = requests.get("https://weather.com/weather/tenday/l/USCA1037:1:US")

element_object = lxml.html.fromstring(html.content)
table = element_object.xpath('//div[@class="twc-table-scroller"]')[0]

td = table.xpath('.//tr/td[@class="twc-sticky-col"]/@title')
print(td)

结果

['Mostly cloudy skies early, then partly cloudy after midnight. Low 48F. Winds SSW at 5 to 10 mph.', 
 'Mainly sunny. High 66F. Winds WNW at 5 to 10 mph.', 
 'Sunny. High 71F. Winds NW at 5 to 10 mph.', 
 'A mainly sunny sky. High 69F. Winds W at 5 to 10 mph.', 
 'Some clouds in the morning will give way to mainly sunny skies for the afternoon. High 67F. Winds WSW at 5 to 10 mph.', 
 'Considerable clouds early. Some decrease in clouds later in the day. High 67F. Winds WSW at 5 to 10 mph.', 
 'Partly cloudy. High near 65F. Winds WSW at 5 to 10 mph.', 
 'Cloudy skies early, then partly cloudy in the afternoon. High 61F. Winds WSW at 10 to 20 mph.', 
 'Sunny skies. High 62F. Winds WNW at 10 to 20 mph.', 
 'Mainly sunny. High 61F. Winds WNW at 10 to 20 mph.', 
 'Sunny along with a few clouds. High 64F. Winds WNW at 10 to 15 mph.', 
 'Mostly sunny skies. High around 65F. Winds WNW at 10 to 15 mph.', 
 'Mostly sunny skies. High 66F. Winds WNW at 10 to 20 mph.', 
 'Mainly sunny. High around 65F. Winds WNW at 10 to 20 mph.', 
 'A mainly sunny sky. High around 65F. Winds WNW at 10 to 20 mph.']

HTML中没有<tbody>,但是Web浏览器可以在DevTool中显示它-因此,请勿在xpath中使用tbody

<span></span>中有一些文本,而<span><span></span></span>中有一些文本

import requests
import lxml.html

html = requests.get("https://weather.com/weather/tenday/l/USCA1037:1:US")

element_object = lxml.html.fromstring(html.content)
table = element_object.xpath('//div[@class="twc-table-scroller"]')[0]

td = table.xpath('.//tr/td//span/text()')
print(td)

结果

['Tonight', 'APR 21', 'Partly Cloudy', '--', '48', '10', '%', 'SSW 7 mph ', '85', '%', 
 'Mon', 'APR 22', 'Sunny', '66', '51', '10', '%', 'WNW 9 mph ', '67', '%', 
 'Tue', 'APR 23', 'Sunny', '71', '53', '0', '%', 'NW 8 mph ', '59', '%', 
 'Wed', 'APR 24', 'Sunny', '69', '52', '10', '%', 'W 9 mph ', '71', '%', 
 'Thu', 'APR 25', 'Partly Cloudy', '67', '51', '10', '%', 'WSW 9 mph ', '71', '%', 
 'Fri', 'APR 26', 'Partly Cloudy', '67', '51', '10', '%', 'WSW 9 mph ', '69', '%', 
 'Sat', 'APR 27', 'Partly Cloudy', '65', '50', '10', '%', 'WSW 9 mph ', '71', '%',   
 'Sun', 'APR 28', 'AM Clouds/PM Sun', '61', '49', '20', '%', 'WSW 13 mph ', '75', '%', 
 'Mon', 'APR 29', 'Sunny', '62', '48', '10', '%', 'WNW 14 mph ', '63', '%', 
 'Tue', 'APR 30', 'Sunny', '61', '49', '0', '%', 'WNW 14 mph ', '61', '%', 
 'Wed', 'MAY 1', 'Mostly Sunny', '64', '50', '0', '%', 'WNW 12 mph ', '60', '%', 
 'Thu', 'MAY 2', 'Mostly Sunny', '65', '50', '0', '%', 'WNW 12 mph ', '61', '%', 
 'Fri', 'MAY 3', 'Mostly Sunny', '66', '51', '0', '%', 'WNW 13 mph ', '61', '%', 
 'Sat', 'MAY 4', 'Sunny', '65', '51', '0', '%', 'WNW 14 mph ', '62', '%', 
 'Sun', 'MAY 5', 'Sunny', '65', '51', '0', '%', 'WNW 14 mph ', '63', '%']

答案 1 :(得分:0)

如果要抓取Sunny. High 66F. Winds WNW at 5 to 10 mph.之类的文本,可以从<td>的标题属性中获取它们。

这应该有效。

td = table.xpath('.//tbody/tr/td[@class="description"]/@title')