我一直在尝试提取网站上表格中的内容。
descriptions = []
sources = []
values = []
site = 'https://www.eia.gov/todayinenergy/prices.php' #address of the site
driver = webdriver.Chrome(executable_path=r"chromedriver.exe")
driver.execute_script("document.body.style.zoom='100%'")
driver.get(site)
soup_1 = bs(driver.page_source, 'lxml') #clean up the site using beautiful soup
tables = soup_1.find_all('tbody') #script of interest
print(len(tables)) #count the scripts
for table in tables:
rows = table.find_all('tr')
print(len(rows))
for row in rows:
description = row.find('td', class_='s1')
descriptions.append(descri_clean)
source = row.find('td', class_='s2')
sources.append(source_clean)
value = row.find('td', class_='d1') #find the row that gives the data
values.append(value_clean) #compile it all together
driver.close()
我一直在尝试从表格中获取干净的文本,但是提取的数据看起来像这样。
<td class="s1" rowspan="3">Crude Oil<br/> ($/barrel)</td>
虽然我想要的只是''原油($ /桶)
当我尝试
description = row.find('td', class_='s1').text.renderContents()
descriptions.append(descri_clean)
出现错误
AttributeError: 'NoneType' object has no attribute 'renderContents'
答案 0 :(得分:0)
您可以只使用请求。您可以通过在循环表行时对某些类属性的期望值进行字符串匹配来过滤出您的值。我将两个感兴趣的表设置为单独的变量,这些变量是这些表中的行的列表。页面上的每个表格都有各自不同的类别标识符,例如表编号。 t1,t2 ......
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.eia.gov/todayinenergy/prices.php')
soup = bs(r.content, 'lxml')
table1 = soup.select('.t1 tr')
table2 = soup.select('.t2 tr')
for item in table1:
if 'Crude Oil ($/barrel) - Nymex Apr' in item.text:
rowInfo = [td.text for td in item.select('td')]
print(rowInfo)
elif 'Ethanol ($/gallon) - CBOT Apr' in item.text:
rowInfo = [td.text for td in item.select('td')]
print(rowInfo)
for item in table2:
if len(item.select('td')) == 4:
header = item.select_one('td.s1').text
if item.select_one('td.s2'):
if item.select_one('td.s2').text in ['WTI','Brent','Louisiana Light','Los Angeles'] and header in ['Crude Oil ($/barrel)','Gasoline (RBOB) ($/gallon)']:
rowInfo = [td.text for td in item.select('td')]
print(rowInfo)