在Python中使用BeautifulSoup进行表格数据抓取

时间:2020-05-20 13:03:35

标签: python web-scraping beautifulsoup

Code

为什么我在Python中使用BeautifulSoup提取表数据时没有得到所有行?

链接到网站-http://www.fao.org/3/x0490e/x0490e04.htm

table1_rows = table1.find_all('tr')

for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)

Output of above code

print(row)
row = [item.strip() for item in row if str(item)]
row

But i'm getting this output

进行一些更改后

for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
print(row)

然后我也没有得到输出。谁能帮帮我吗? 当我在循环外打印行变量时,我没有得到输出?

Output

2 个答案:

答案 0 :(得分:0)

此行:

row = [item.strip() for item in row if str(item)]

应坐在for tr in table1_rows循环内:

for tr in table1_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    row = [item.strip() for item in row if str(item)]
    print(row)

编辑: 要收集所有行:

all_rows=[]
for tr in table1_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    row = [item.strip() for item in row if str(item)]
    all_rows.append(row)

for row in all_rows:
    print(row)

修改2: 如果最终目的是将表数据放入一个数据框中,那将是一个简单的工作(此替换 for循环方法):

df=pd.read_html(url)[0]

您显然首先需要导入熊猫:

import pandas as pd

输出:

print(df)

enter image description here

答案 1 :(得分:0)

在下一个jupyter块中,您似乎处于循环的结尾。该表的格式也有点怪异,所以我去做了这件事,以获取数据和列标题作为嵌套字典列表:

import requests
import pandas as pd
import pprint
from bs4 import BeautifulSoup


url = 'http://www.fao.org/3/x0490e/x0490e04.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)

table = soup.find('table')

def clean(text):
    return text.replace('\r', '').replace('\n', '').replace('  ', '').strip()

# get the column headers
headers = [clean(col.text)
           for col in table.find_all('tr')[1].find_all('td')]
# set the first column to 'name' because it is blank
headers.insert(0, 'name') 

# get the data rows and zip them to the column headers
data = [{col[0]: clean(col[1].text)
         for col in zip(headers, row.find_all('td'))}
        for row in table.find_all('tr')[2::]]

data_list = [headers] + [list(row.values()) for row in data]

# print to list of lists
pprint.pprint(data_list)
# pretty print to json
import json
print(json.dumps(data, indent=4))
# print to dataframe
df = pd.DataFrame(data)
print(df)

输出:

[['name', 'mm day-1', 'm3 ha-1 day-1', 'l s-1 ha-1', 'MJ m-2 day-1'],
 ['1 mm day-1', '1', '10', '0.116', '2.45'],
 ['1 m3 ha-1 day-1', '0.1', '1', '0.012', '0.245'],
 ['1 l s-1 ha-1', '8.640', '86.40', '1', '21.17'],
 ['1 MJ m-2 day-1', '0.408', '4.082', '0.047', '1']]
[
    {
        "name": "1 mm day-1",
        "mm day-1": "1",
        "m3 ha-1 day-1": "10",
        "l s-1 ha-1": "0.116",
        "MJ m-2 day-1": "2.45"
    },
    {
        "name": "1 m3 ha-1 day-1",
        "mm day-1": "0.1",
        "m3 ha-1 day-1": "1",
        "l s-1 ha-1": "0.012",
        "MJ m-2 day-1": "0.245"
    },
    {
        "name": "1 l s-1 ha-1",
        "mm day-1": "8.640",
        "m3 ha-1 day-1": "86.40",
        "l s-1 ha-1": "1",
        "MJ m-2 day-1": "21.17"
    },
    {
        "name": "1 MJ m-2 day-1",
        "mm day-1": "0.408",
        "m3 ha-1 day-1": "4.082",
        "l s-1 ha-1": "0.047",
        "MJ m-2 day-1": "1"
    }
]
              name mm day-1 m3 ha-1 day-1 l s-1 ha-1 MJ m-2 day-1
0       1 mm day-1        1            10      0.116         2.45
1  1 m3 ha-1 day-1      0.1             1      0.012        0.245
2     1 l s-1 ha-1    8.640         86.40          1        21.17
3   1 MJ m-2 day-1    0.408         4.082      0.047            1

我的df输出

     MJ m-2 day-1 l s-1 ha-1 m3 ha-1 day-1 mm day-1             name
    0         2.45      0.116            10        1       1 mm day-1
    1        0.245      0.012             1      0.1  1 m3 ha-1 day-1
    2        21.17          1         86.40    8.640     1 l s-1 ha-1
    3            1      0.047         4.082    0.408   1 MJ m-2 day-1