我正在阅读有关read_html pandas功能的内容,因为我正在从网上提取一些表格,所以当我这样做时:
import pandas as pd
url_mcc = 'link.com.html'
dfs = pd.read_html(url_mcc)
dfs
我得到以下列表:
[ Presentation \
0 0.4 mg/mL, 1 mL single-dose vial, package of 2...
1 1 mg/mL, 1 mL single-dose vial, package of 25 ...
Availability and Estimated Shortage Duration \
0 Available for NDC 00517-0401-25.
1 Available
Related Information \
0 American Regent is currently releasing the 0.4...
1 American Regent is currently releasing the 1mg...
Shortage Reason (per FDASIA)
0 Demand increase for the drug
1 Other ,
Presentation \
0 0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...
Availability and Estimated Shortage Duration Related Information \
0 Product available NaN
Shortage Reason (per FDASIA)
0 Demand increase for the drug ,
Presentation \
0 0.1 mg/mL; 10 mL Ansyr syringe (NDC 0409-1630-10)
1 0.05 mg/mL; 5 mL Ansyr syringe (NDC 0409-9630-05)
2 0.1 mg/mL; 5 mL Lifeshield syringe (NDC 0409-4...
3 0.1 mg/mL; 10 mL Lifeshield syringe (NDC 0409-...
Availability and Estimated Shortage Duration \
0 Next delivery: Late October. Estimated recover...
1 Next delivery: TBD Estimated recovery: TBD
2 Available
3 Available
Related Information \
0 Please check with your wholesaler for availabl...
1 Please check with your wholesaler for availabl...
2 Shortage per Manufacturer: Available
3 Shortage per Manufacturer: Available
Shortage Reason (per FDASIA)
0 Other
1 Other
2 Other
3 Other ,
Presentation \
0 0.4 mg/mL, 20 mL vial (NDC 0641-6006-10)
Availability and Estimated Shortage Duration \
0 West-Ward has available inventory.
Related Information \
0 Additional lots are scheduled to be manufactur...
Shortage Reason (per FDASIA)
0 Demand increase for the drug ]
正如您所看到的那样,列表(或表?)包含重复的列:Presentation
,Availability and Estimated Shortage Duration
,Related Information
,Shortage Reason (per FDASIA)
,因为网站有3个不同的表格相同的列。因此,我的问题是如何将所有不同的表或列表平面化或规范化为单个表,或多或少像这样:
[ Presentation \
0 0.4 mg/mL, 1 mL single-dose vial, package of 2...
1 1 mg/mL, 1 mL single-dose vial, package of 25 ...
2 1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...
3 0.1 mg/mL; 10 mL Ansyr syringe (NDC 0409-1630-10)
4 0.05 mg/mL; 5 mL Ansyr syringe (NDC 0409-9630-05)
5 0.1 mg/mL; 5 mL Lifeshield syringe (NDC 0409-4...
6 0.1 mg/mL; 10 mL Lifeshield syringe (NDC 0409-...
Availability and Estimated Shortage Duration \
0 Available for NDC 00517-0401-25.
1 Available
2 Product available NaN
0 Next delivery: Late October. Estimated recover...
1 Next delivery: TBD Estimated recovery: TBD
2 Available
3 Available
0 0.4 mg/mL, 20 mL vial (NDC 0641-6006-10)
Availability and Estimated Shortage Duration \
0 West-Ward has available inventory.
Shortage Reason (per FDASIA)
0 Demand increase for the drug
Related Information \
0 American Regent is currently releasing the 0.4...
1 American Regent is currently releasing the 1mg...
0 Please check with your wholesaler for availabl...
1 Please check with your wholesaler for availabl...
2 Shortage per Manufacturer: Available
3 Shortage per Manufacturer: Available
0 Additional lots are scheduled to be manufactur...
Shortage Reason (per FDASIA)
0 Demand increase for the drug
1 Other ,
Shortage Reason (per FDASIA)
0 Demand increase for the drug ,
0 Other
1 Other
2 Other
3 Other ,
答案 0 :(得分:2)
如果dfs
是DataFrames
的列表,我认为您需要concat
:
df = pd.concat(dfs)
此外,您可以使用参数ignore_index=True
来避免重复索引:
df = pd.concat(dfs, ignore_index=True)
样品:
df1 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df1)
df2 = pd.DataFrame({'A':[3,4,6],
'B':[2,3,4],
'C':[3,6,0]})
#print (df2)
df3 = pd.DataFrame({'A':[4,7,9],
'B':[3,4,5],
'C':[5,1,9]})
#print (df3)
dfs = [df1,df2,df3]
print (dfs)
[ A B C
0 1 4 7
1 2 5 8
2 3 6 9, A B C
0 3 2 3
1 4 3 6
2 6 4 0, A B C
0 4 3 5
1 7 4 1
2 9 5 9]
df = pd.concat(dfs)
print (df)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
0 3 2 3
1 4 3 6
2 6 4 0
0 4 3 5
1 7 4 1
2 9 5 9
df1 = pd.concat(dfs, ignore_index=True)
print (df1)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
3 3 2 3
4 4 3 6
5 6 4 0
6 4 3 5
7 7 4 1
8 9 5 9