用beautifulsoup在python中创建表

时间:2018-09-06 22:12:21

标签: python parsing beautifulsoup

这里是python的新手,我有一个问题,是要使用Beautiful汤从刮板上创建表格。这是我正在使用的代码:

import requests
page=requests.get("https://www.opensecrets.org/lobby/lobbyist.php?id=Y0000008510L&year=2018")
from bs4 import BeautifulSoup
soup=BeautifulSoup(page.content, 'lxml')
table=soup.find(‘table’,{‘id’:’lobbyist_summary’})
for row in table:
    cells=row.find_all(‘a’)
    rn=cells[0].get_text()

错误是:

AttributeError: 'NavigableString' object has no attribute 'find_all'

print(table)看起来像这样:

[<a href="firmsum.php?id=D000037635&amp;year=2018">Ballard Partners</a>, <a href="clientsum.php?id=F203227&amp;year=2018">Advanced Roofing Inc</a>, <a href="clientsum.php?id=F214670&amp;year=2018">Africell Holding</a>, <a href="clientsum.php?id=D000023883&amp;year=2018">Amazon.com</a>, ...]

(最后)我想得到一个表,该表在每个单独的列中包含每个感兴趣的元素,因此它看起来像:

[[firmssum,D000037635,2018,Ballard Partners],[clientsum,F203227,2018,Advanced Roofing Inc],[clientsum,F214670,2018,Africell Holding],[clientsum,D000023883,2018,Amazon.com]。 ..]

1 个答案:

答案 0 :(得分:0)

分配4个空列表:

col1List = list()
col2List = list()
col3List = list()
col4List = list()

首先,让我们获取第4列的值:

trs = table.find_all('tr')[1]
tds = trs.find_all('a')

for i in range(len(tds)):
    col4List.append(tds[i].get_text())

这给出了:

['Ballard Partners', 'Advanced Roofing Inc', 'Africell Holding',....]

现在,让我们从href中提取前3列的值:

hrefVal = trs.find_all('a')

for i in hrefVal:
    hVal = i.get('href')
    col11 = hVal.split('.php?id=', 1)
    col1 = col11[0]
    col1List.append(col1)
    col22 = col11[1].split('&', 1)
    col2 = col22[0]
    col2List.append(col2)
    col33 = col22[1].split('=', 1)
    col3 = col33[1]
    col3List.append(col3)

现在,让我们将所有列表放在一个数据框中以使其看起来整洁:

import pandas as pd

df = pd.DataFrame()
df['Col1'] = col1List
df['Col2'] = col2List
df['Col3'] = col3List
df['Col4'] = col4List

如果我输出前几行,它将看起来像您想要的样子:

Col1        Col2        Col3    Col4
firmsum     D000037635  2018    Ballard Partners
clientsum   F203227     2018    Advanced Roofing Inc
clientsum   F214670     2018    Africell Holding
clientsum   D000023883  2018    Amazon.com
clientsum   D000000192  2018    American Health Care Assn
clientsum   D000021839  2018    American Road & Transport Builders Assn