如何从包含null和多个值的一列中分离出几列?

时间:2019-10-18 10:02:03

标签: pandas dataframe machine-learning python-3.7

我已将此文件从PDF转换为CSV以训练模型。 pdf文件中的三列已合并为csv中的一列,例如ProductID,商品和国家/地区。

我试图在正则表达式的帮助下分隔这些列,但是我不确定这些列将如何运行。

这是我要处理的数据:

                   country/commodity Unit        Quantity      Value
1     0011101 BREEDING BULLS (OXEN)   NO            NaN          75
2                             DUBAI  NaN            NaN          75
3  0011102 BREEDING BULLS (BUFFALO)   NO            248        1921
4                         SRI LUNKA  NaN            248        1921
5          0011103 BUFFALO,BREEDING   NO            NaN          90
6                         SRI LUNKA  NaN            NaN          90
7             0011104 COWS BREEDING   NO           1249   258921665
8                             AJMAN  NaN            NaN         NaN
9                            CYPRUS  NaN            NaN         NaN 

我需要此数据采用以下格式:

0    ProductID      Commodity           Country     Unit  Quantity    Value 
1     0011101    BREEDING BULLS (OXEN)   DUBAI      NaN    NaN          75
3     0011102   BREEDING BULLS (BUFFALO) SRI LUNKA  NaN    248         1921
4     0011103   BUFFALO,BREEDING         SRI LUNKA  NaN    NaN          90            
7     0011104   COWS BREEDING            AJMAN      NaN    NaN         NaN        
8     0011104   COWS BREEDING            CYPRUS     NaN    NaN         NaN                        
9     0011104   COWS BREEDING            CHINA      NaN    590         3290

1 个答案:

答案 0 :(得分:0)

首先,通过使用以下方法减去ProductID, Commodity, Country列中的信息,使您的列country/commodity变成

  • str.split
  • str.extract
  • Series.where
  • Series.mask
  • str.contains

然后我们在GroupByProductID收集相应产品的信息,为此我们使用named aggregation,这是pandas 0.25.0之后的新内容:

# Extract information from country/commodity
df['ProductID'] = df['country/commodity'].str.split(' ', 1).str[0].str.extract('(\d+)').ffill()
df['Commodity'] = df['country/commodity'].str.split('\d+').str[-1].where(df['Unit'].notna())
df['Country'] = df['country/commodity'].mask(df['country/commodity'].str.contains('\d+')).fillna('')

# Groupby ProductID to get information together
df_new = df.groupby(['ProductID']).agg(
    Commodity=('Commodity', 'first'),
    Country=('Country', ', '.join),
    Unit=('Unit', 'first'),
    Quantity=('Quantity', 'first'),
    Value=('Value', 'first')
).reset_index()

# Remove unnecessary comma's
df_new['Country'] = df_new['Country'].str.lstrip(', ')

输出

  ProductID                  Commodity        Country Unit  Quantity  \
0   0011101      BREEDING BULLS (OXEN)          DUBAI   NO       NaN   
1   0011102   BREEDING BULLS (BUFFALO)      SRI LUNKA   NO     248.0   
2   0011103           BUFFALO,BREEDING      SRI LUNKA   NO       NaN   
3   0011104              COWS BREEDING  AJMAN, CYPRUS   NO    1249.0   

         Value  
0         75.0  
1       1921.0  
2         90.0  
3  258921665.0