如何从pandas read_html中读取和平整/规范化一系列表格?

时间:2016-11-01 18:15:21

标签: python python-3.x pandas

我正在阅读有关read_html pandas功能的内容,因为我正在从网上提取一些表格,所以当我这样做时:

import pandas as pd
url_mcc = 'link.com.html'
dfs = pd.read_html(url_mcc)
dfs

我得到以下列表:

[                                        Presentation  \
 0  0.4 mg/mL, 1 mL single-dose vial, package of 2...   
 1  1 mg/mL, 1 mL single-dose vial, package of 25 ...   

   Availability and Estimated Shortage Duration  \
 0             Available for NDC 00517-0401-25.   
 1                                    Available   

                                  Related Information  \
 0  American Regent is currently releasing the 0.4...   
 1  American Regent is currently releasing the 1mg...   

    Shortage Reason (per FDASIA)  
 0  Demand increase for the drug  
 1                         Other  ,
                                         Presentation  \
 0  0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...   

   Availability and Estimated Shortage Duration  Related Information  \
 0                            Product available                  NaN   

    Shortage Reason (per FDASIA)  
 0  Demand increase for the drug  ,
                                         Presentation  \
 0  0.1 mg/mL; 10 mL Ansyr syringe (NDC 0409-1630-10)   
 1  0.05 mg/mL; 5 mL Ansyr syringe (NDC 0409-9630-05)   
 2  0.1 mg/mL; 5 mL Lifeshield syringe (NDC 0409-4...   
 3  0.1 mg/mL; 10 mL Lifeshield syringe (NDC 0409-...   

         Availability and Estimated Shortage Duration  \
 0  Next delivery: Late October. Estimated recover...   
 1         Next delivery: TBD Estimated recovery: TBD   
 2                                          Available   
 3                                          Available   

                                  Related Information  \
 0  Please check with your wholesaler for availabl...   
 1  Please check with your wholesaler for availabl...   
 2               Shortage per Manufacturer: Available   
 3               Shortage per Manufacturer: Available   

   Shortage Reason (per FDASIA)  
 0                        Other  
 1                        Other  
 2                        Other  
 3                        Other  ,
                                Presentation  \
 0  0.4 mg/mL, 20 mL vial (NDC 0641-6006-10)   

   Availability and Estimated Shortage Duration  \
 0           West-Ward has available inventory.   

                                  Related Information  \
 0  Additional lots are scheduled to be manufactur...   

    Shortage Reason (per FDASIA)  
 0  Demand increase for the drug  ]

正如您所看到的那样,列表(或表?)包含重复的列:PresentationAvailability and Estimated Shortage DurationRelated InformationShortage Reason (per FDASIA),因为网站有3个不同的表格相同的列。因此,我的问题是如何将所有不同的表或列表平面化或规范化为单个表,或多或少像这样:

[                                        Presentation  \
 0  0.4 mg/mL, 1 mL single-dose vial, package of 2...   
 1  1 mg/mL, 1 mL single-dose vial, package of 25 ...   
 2  1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N... 
 3  0.1 mg/mL; 10 mL Ansyr syringe (NDC 0409-1630-10)   
 4  0.05 mg/mL; 5 mL Ansyr syringe (NDC 0409-9630-05)   
 5  0.1 mg/mL; 5 mL Lifeshield syringe (NDC 0409-4...   
 6  0.1 mg/mL; 10 mL Lifeshield syringe (NDC 0409-...   



   Availability and Estimated Shortage Duration  \
 0             Available for NDC 00517-0401-25.   
 1                                    Available  
 2                            Product available                  NaN   
 0  Next delivery: Late October. Estimated recover...   
 1         Next delivery: TBD Estimated recovery: TBD   
 2                                          Available   
 3                                          Available  
 0  0.4 mg/mL, 20 mL vial (NDC 0641-6006-10)   

   Availability and Estimated Shortage Duration  \
 0           West-Ward has available inventory.   


    Shortage Reason (per FDASIA)  
 0  Demand increase for the drug  


                                  Related Information  \
 0  American Regent is currently releasing the 0.4...   
 1  American Regent is currently releasing the 1mg...   
 0  Please check with your wholesaler for availabl...   
 1  Please check with your wholesaler for availabl...   
 2               Shortage per Manufacturer: Available   
 3               Shortage per Manufacturer: Available   
 0  Additional lots are scheduled to be manufactur...   


    Shortage Reason (per FDASIA)  
 0  Demand increase for the drug  
 1                         Other  ,



    Shortage Reason (per FDASIA)  
 0  Demand increase for the drug  ,
 0                        Other  
 1                        Other  
 2                        Other  
 3                        Other  ,

1 个答案:

答案 0 :(得分:2)

如果dfsDataFrames的列表,我认为您需要concat

df = pd.concat(dfs)

此外,您可以使用参数ignore_index=True来避免重复索引:

df = pd.concat(dfs, ignore_index=True) 

样品:

df1 = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df1)

df2 = pd.DataFrame({'A':[3,4,6],
                   'B':[2,3,4],
                   'C':[3,6,0]})

#print (df2)

df3 = pd.DataFrame({'A':[4,7,9],
                   'B':[3,4,5],
                   'C':[5,1,9]})

#print (df3)

dfs = [df1,df2,df3]
print (dfs)
[   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9,    A  B  C
0  3  2  3
1  4  3  6
2  6  4  0,    A  B  C
0  4  3  5
1  7  4  1
2  9  5  9]
df = pd.concat(dfs)
print (df)
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
0  3  2  3
1  4  3  6
2  6  4  0
0  4  3  5
1  7  4  1
2  9  5  9

df1 = pd.concat(dfs, ignore_index=True) 
print (df1)
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
3  3  2  3
4  4  3  6
5  6  4  0
6  4  3  5
7  7  4  1
8  9  5  9