从多个URL导入表并希望创建一个数据框,然后将其存储为csv文件。我正在努力从表中删除重复的描述,并且创建后无法操作数据框 dfmaster 。
也许pd.read_html
是作为列表而不是数据框导入的?
我试图遍历传入表并使用;
for item in df:
if item not in dfmaster:
dfmaster.append(item)
print(dfmaster)
但这似乎列出了令人反感的重复行。
在附加到 dfmaster 和drop.duplicates
df.drop[0]
producturls = ['https://www.interactivebrokers.com/en/index.php?f=2222&exch=ecbot&showcategories=FUTGRP',
'https://www.interactivebrokers.com/en/index.php?f=2222&exch=cfe&showcategories=FUTGRP',
'https://www.interactivebrokers.com/en/index.php?f=2222&exch=dtb&showcategories=FUTGRP&p=&cc=&limit=100&page=2'
]
dfmaster =[]
for url in producturls:
table = pd.read_html(url, index_col=None, header=None,)
df = table[2]
for item in df:
if item not in dfmaster:
dfmaster.append(item)
print(dfmaster)
dfmaster.to_csv('IB_tickers.csv')
输出应将来自网站的所有表数据缝合到一个数据框中,而无需重复描述标题,然后将其创建并存储为可读的csv文件。
非常感谢您的光临。
答案 0 :(得分:0)
这应该对您有用:
import pandas as pd
from tabulate import tabulate
producturls = ['https://www.interactivebrokers.com/en/index.php?f=2222&exch=ecbot&showcategories=FUTGRP',
'https://www.interactivebrokers.com/en/index.php?f=2222&exch=cfe&showcategories=FUTGRP',
'https://www.interactivebrokers.com/en/index.php?f=2222&exch=dtb&showcategories=FUTGRP&p=&cc=&limit=100&page=2'
]
df_list = []
for url in producturls:
table = pd.read_html(url, index_col=None, header=None,)
df = table[2]
df_list.append(df)
dfmaster = pd.concat(df_list, sort=False)
dfmaster = dfmaster.drop_duplicates().reset_index(drop=True)
print(tabulate(dfmaster.head(), headers='keys'))
dfmaster.to_csv('IB_tickers.csv')
结果:
IB Symbol Product Description Symbol Currency
(click link for more details)
-- ----------- ------------------------------------------------------- -------- ----------
0 AC Ethanol -CME EH USD
1 AIGCI Bloomberg Commodity Index AW USD
2 B1U 30-Year Deliverable Interest Rate Swap Futures B1U USD
3 DJUSRE Dow Jones US Real Estate Index RX USD
4 F1U 5-Year Deliverable Interest Rate Swap Futures F1U USD