这里是python的新手,我有一个问题,是要使用Beautiful汤从刮板上创建表格。这是我正在使用的代码:
import requests
page=requests.get("https://www.opensecrets.org/lobby/lobbyist.php?id=Y0000008510L&year=2018")
from bs4 import BeautifulSoup
soup=BeautifulSoup(page.content, 'lxml')
table=soup.find(‘table’,{‘id’:’lobbyist_summary’})
for row in table:
cells=row.find_all(‘a’)
rn=cells[0].get_text()
错误是:
AttributeError: 'NavigableString' object has no attribute 'find_all'
print(table)看起来像这样:
[<a href="firmsum.php?id=D000037635&year=2018">Ballard Partners</a>, <a href="clientsum.php?id=F203227&year=2018">Advanced Roofing Inc</a>, <a href="clientsum.php?id=F214670&year=2018">Africell Holding</a>, <a href="clientsum.php?id=D000023883&year=2018">Amazon.com</a>, ...]
(最后)我想得到一个表,该表在每个单独的列中包含每个感兴趣的元素,因此它看起来像:
[[firmssum,D000037635,2018,Ballard Partners],[clientsum,F203227,2018,Advanced Roofing Inc],[clientsum,F214670,2018,Africell Holding],[clientsum,D000023883,2018,Amazon.com]。 ..]
答案 0 :(得分:0)
分配4个空列表:
col1List = list()
col2List = list()
col3List = list()
col4List = list()
首先,让我们获取第4列的值:
trs = table.find_all('tr')[1]
tds = trs.find_all('a')
for i in range(len(tds)):
col4List.append(tds[i].get_text())
这给出了:
['Ballard Partners', 'Advanced Roofing Inc', 'Africell Holding',....]
现在,让我们从href
中提取前3列的值:
hrefVal = trs.find_all('a')
for i in hrefVal:
hVal = i.get('href')
col11 = hVal.split('.php?id=', 1)
col1 = col11[0]
col1List.append(col1)
col22 = col11[1].split('&', 1)
col2 = col22[0]
col2List.append(col2)
col33 = col22[1].split('=', 1)
col3 = col33[1]
col3List.append(col3)
现在,让我们将所有列表放在一个数据框中以使其看起来整洁:
import pandas as pd
df = pd.DataFrame()
df['Col1'] = col1List
df['Col2'] = col2List
df['Col3'] = col3List
df['Col4'] = col4List
如果我输出前几行,它将看起来像您想要的样子:
Col1 Col2 Col3 Col4
firmsum D000037635 2018 Ballard Partners
clientsum F203227 2018 Advanced Roofing Inc
clientsum F214670 2018 Africell Holding
clientsum D000023883 2018 Amazon.com
clientsum D000000192 2018 American Health Care Assn
clientsum D000021839 2018 American Road & Transport Builders Assn