我想从本网站的表格和段落文本中提取各种元素。
https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655
这是我正在使用的代码:
import lxml
from lxml import html
from lxml import etree
import urllib2
source = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30656&SSO=1').read()
x = etree.HTML(source)
growth = x.xpath("//*[@id="home_feature_container"]/div/div[2]/div/table[2]/tbody/tr[3]/td[2]/p)")
growth
从网站中提取我想要的元素的最佳方法是什么,而不必每次都更改代码中的XPath?他们每月在同一个网站上发布新数据,但XPath似乎有时会发生一些变化。
答案 0 :(得分:1)
如果您想要的项目的位置定期更改,请尝试按名称检索它们。例如,这是如何从"新订单"中提取表格中的元素。行。
import requests #better than urllib
from lxml import html, etree
url = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
page = requests.get(url)
tree = html.fromstring(page.content)
neworders = tree.xpath('//strong[text()="New Orders"]/../../following-sibling::td/p/text()')
print(neworders)
或者如果你想要整个html表:
data = tree.xpath('//th[text()="MANUFACTURING AT A GLANCE"]/../..')
for elements in data:
print(etree.tostring(elements, pretty_print=True))
使用BeautifulSoup的另一个例子
from bs4 import BeautifulSoup
import requests
url = "https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1"
content = requests.get(url).content
soup = BeautifulSoup(content, "lxml")
table = soup.find_all('table')[1]
table_body = table.find('tbody')
data= []
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
print(data)
答案 1 :(得分:0)
from bs4 import BeautifulSoup
import urllib2
r = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655')
soup = BeautifulSoup(r)
soup.find('div', {'id': 'home_feature_container'}, 'h4')
此代码正在实现所述的规范。如果您使用soup.find().contents
,则会创建元素中包含的每个项目的列表。
至于考虑页面上的变化,这实际上取决于。如果变化很大,则必须更改soup.find()
。否则,您可以编写足够通用的代码,以便始终应用。 (就好像div
名为 home_feature_container 总是很有特色,你永远不必改变它。)