我在python中编写了一个脚本来解析网页中的一些数据,并通过panda将其写入csv文件。到目前为止,我所写的内容可以解析该页面中的所有可用表格,但是在写入csv文件的情况下,它会将该页面中的最后一个表格写入该csv文件。当然,由于循环,数据被覆盖。如何解决这个缺陷,以便我的刮刀能够从不同的表而不是最后一个表中写入所有数据?提前谢谢。
import csv
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get('http://www.espn.com/nba/schedule/_/date/20171001').text
soup = BeautifulSoup(res,"lxml")
for table in soup.find_all("table"):
df = pd.read_html(str(table))[0]
df.to_csv("table_item.csv")
print(df)
顺便说一下,我希望只使用panda将数据写入csv文件。再次感谢。
答案 0 :(得分:1)
您可以在网页中使用read_html
返回list of DataFrames
的内容,因此df
需要concat
:
dfs = pd.read_html('http://www.espn.com/nba/schedule/_/date/20171001')
df = pd.concat(dfs, ignore_index=True)
#if necessary rename columns
d = {'Unnamed: 1':'a', 'Unnamed: 7':'b'}
df = df.rename(columns=d)
print (df.head())
matchup a time (ET) nat tv away tv home tv \
0 Atlanta ATL Miami MIA NaN NaN NaN NaN
1 LA LAC Toronto TOR NaN NaN NaN NaN
2 Guangzhou Guangzhou Washington WSH NaN NaN NaN NaN
3 Charlotte CHA Boston BOS NaN NaN NaN NaN
4 Orlando ORL Memphis MEM NaN NaN NaN NaN
tickets b
0 2,401 tickets available from $6 NaN
1 284 tickets available from $29 NaN
2 2,792 tickets available from $2 NaN
3 2,908 tickets available from $6 NaN
4 1,508 tickets available from $3 NaN
最后to_csv
用于写入文件:
df.to_csv("table_item.csv", index=False)
编辑:
为了便于学习,可以将每个DataFrame
添加到列表中,然后添加concat
:
res = requests.get('http://www.espn.com/nba/schedule/_/date/20171001').text
soup = BeautifulSoup(res,"lxml")
dfs = []
for table in soup.find_all("table"):
df = pd.read_html(str(table))[0]
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
print(df)
df.to_csv("table_item.csv")