为什么我在Python中使用BeautifulSoup提取表数据时没有得到所有行?
链接到网站-http://www.fao.org/3/x0490e/x0490e04.htm
table1_rows = table1.find_all('tr')
for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
print(row)
row = [item.strip() for item in row if str(item)]
row
进行一些更改后
for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
print(row)
然后我也没有得到输出。谁能帮帮我吗? 当我在循环外打印行变量时,我没有得到输出?
答案 0 :(得分:0)
此行:
row = [item.strip() for item in row if str(item)]
应坐在for tr in table1_rows
循环内:
for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
print(row)
编辑: 要收集所有行:
all_rows=[]
for tr in table1_rows:
td = tr.find_all('td')
row = [i.text for i in td]
row = [item.strip() for item in row if str(item)]
all_rows.append(row)
for row in all_rows:
print(row)
修改2: 如果最终目的是将表数据放入一个数据框中,那将是一个简单的工作(此替换 for循环方法):
df=pd.read_html(url)[0]
您显然首先需要导入熊猫:
import pandas as pd
输出:
print(df)
答案 1 :(得分:0)
在下一个jupyter块中,您似乎处于循环的结尾。该表的格式也有点怪异,所以我去做了这件事,以获取数据和列标题作为嵌套字典列表:
import requests
import pandas as pd
import pprint
from bs4 import BeautifulSoup
url = 'http://www.fao.org/3/x0490e/x0490e04.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)
table = soup.find('table')
def clean(text):
return text.replace('\r', '').replace('\n', '').replace(' ', '').strip()
# get the column headers
headers = [clean(col.text)
for col in table.find_all('tr')[1].find_all('td')]
# set the first column to 'name' because it is blank
headers.insert(0, 'name')
# get the data rows and zip them to the column headers
data = [{col[0]: clean(col[1].text)
for col in zip(headers, row.find_all('td'))}
for row in table.find_all('tr')[2::]]
data_list = [headers] + [list(row.values()) for row in data]
# print to list of lists
pprint.pprint(data_list)
# pretty print to json
import json
print(json.dumps(data, indent=4))
# print to dataframe
df = pd.DataFrame(data)
print(df)
输出:
[['name', 'mm day-1', 'm3 ha-1 day-1', 'l s-1 ha-1', 'MJ m-2 day-1'],
['1 mm day-1', '1', '10', '0.116', '2.45'],
['1 m3 ha-1 day-1', '0.1', '1', '0.012', '0.245'],
['1 l s-1 ha-1', '8.640', '86.40', '1', '21.17'],
['1 MJ m-2 day-1', '0.408', '4.082', '0.047', '1']]
[
{
"name": "1 mm day-1",
"mm day-1": "1",
"m3 ha-1 day-1": "10",
"l s-1 ha-1": "0.116",
"MJ m-2 day-1": "2.45"
},
{
"name": "1 m3 ha-1 day-1",
"mm day-1": "0.1",
"m3 ha-1 day-1": "1",
"l s-1 ha-1": "0.012",
"MJ m-2 day-1": "0.245"
},
{
"name": "1 l s-1 ha-1",
"mm day-1": "8.640",
"m3 ha-1 day-1": "86.40",
"l s-1 ha-1": "1",
"MJ m-2 day-1": "21.17"
},
{
"name": "1 MJ m-2 day-1",
"mm day-1": "0.408",
"m3 ha-1 day-1": "4.082",
"l s-1 ha-1": "0.047",
"MJ m-2 day-1": "1"
}
]
name mm day-1 m3 ha-1 day-1 l s-1 ha-1 MJ m-2 day-1
0 1 mm day-1 1 10 0.116 2.45
1 1 m3 ha-1 day-1 0.1 1 0.012 0.245
2 1 l s-1 ha-1 8.640 86.40 1 21.17
3 1 MJ m-2 day-1 0.408 4.082 0.047 1
我的df输出
MJ m-2 day-1 l s-1 ha-1 m3 ha-1 day-1 mm day-1 name
0 2.45 0.116 10 1 1 mm day-1
1 0.245 0.012 1 0.1 1 m3 ha-1 day-1
2 21.17 1 86.40 8.640 1 l s-1 ha-1
3 1 0.047 4.082 0.408 1 MJ m-2 day-1