基于键值熊猫拆分数据框列

时间:2021-04-15 12:56:13

标签: python-3.x pandas dataframe

我有一个这样的数据框:

                               Id Column  Val1   Val2
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0
2                  Cust=acc hit,Data=125   3.0  400.0
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0
5               Cust=evc,Region Info=atz   2.0    NaN

我想将数据帧转换成这样:

                               Id Column  Val1   Val2     Cust Region Info   Data
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN      abc         xyz  123.0
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0      abd         xyz  124.0
2                  Cust=acc hit,Data=125   3.0  400.0  acc hit         NaN  125.0
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0      abc         xyz  126.0
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0  abg nss         xaz  127.0
5               Cust=evc,Region Info=atz   2.0    NaN      evc         atz    NaN

从另一个 question 中,我得到了部分答案。

但是我如何处理键和值中的空格?

编辑:可能有多个键值对(示例中显示的除外)。所以我需要处理任意 'n' 个列的情况。

4 个答案:

答案 0 :(得分:3)

Series.str.findall

我们可以使用带有正则表达式捕获组的 str.findallkey-value 列中提取 Id Column

df.join(pd.DataFrame(map(dict, df['Id Column'].str.findall(r'([^=,]+)=([^,]+)'))))

                               Id Column  Val1   Val2     Cust Region Info Data
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN      abc         xyz  123
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0      abd         xyz  124
2                  Cust=acc hit,Data=125   3.0  400.0  acc hit         NaN  125
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0      abc         xyz  126
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0  abg nss         xaz  127
5               Cust=evc,Region Info=atz   2.0    NaN      evc         atz  NaN

Regex 详情

  • ([^=,]+):第一个捕获组
    • [^=,]+ :匹配列表中不存在的任何字符 [=,] 一次或多次
  • = :逐字匹配 = 字符
  • ([^,]+) :第二个捕获组
    • [^,]+ :匹配列表中不存在的任何字符 [,] 一次或多次

查看在线regex demo

答案 1 :(得分:2)

仅使用您显示的示例,请尝试以下操作。

import pandas as pd
df[["Cust Region","Info","Data"]] = df["IdColumn"].str.extract(r'^Cust=([^,]+)(?:,Region Info=([^,]*))?(?:,Data=(.*))?$', expand=True)
df

Here is the Online demo for used regex

输出如下:

                                IdColumn  Val1   Val2 Cust Region Info Data
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN         abc  xyz  123
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0         abd  xyz  124
2                  Cust=acc hit,Data=125   3.0  400.0     acc hit  NaN  125
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0         abc  xyz  126
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0     abg nss  xaz  127
5               Cust=evc,Region Info=atz   2.0    NaN         evc  atz  NaN

说明: 为上述正则表达式添加详细说明。

^Cust=              ##Checking if value starts from Cust= here.
([^,]+)             ##Creating 1st capturing group which has all values till , here.
(?:,Region Info=    ##Starting a non-capturing group , Region Info= here.
  ([^,]*)           ##Creating 2nd capturing group which has all values till , here.
)?                  ##Closing non-capturing group here.
(?:,Data=           ##Creating non-capturing group which has ,Data= here.
  (.*)              ##Creating 3rd capturing group which has all values till end of value here.
)?$                 ##Closing non-capturing group here at the end of line.

答案 2 :(得分:1)

在列 apply() 上使用 Id Column 并通过拆分获得值。

df['Cust Region'] = df['Id Column'].apply(lambda x: x.split(',')[0].split('=')[-1])
# print(df)

                               Id Column  Val1   Val2 Cust Region
0      Cust=abc,Region Info=xyz,Data=123  0.0     NaN         abc
1      Cust=abd,Region Info=xyz,Data=124  1.0   750.0         abd
2                  Cust=acc hit,Data=125  3.0   400.0     acc hit
3      Cust=abc,Region Info=xyz,Data=126  NaN   200.0         abc
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0     abg nss
5               Cust=evc,Region Info=atz  2.0     NaN         evc

答案 3 :(得分:1)

使用列表推导式拆分 , 然后 = 用于字典列表,因此可以传递给 DataFrame 构造函数:

L = [dict([y.split('=') for y in x.split(',')]) for x in df['Id Column']]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
                               Id Column  Val1   Val2     Cust Region Info  \
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN      abc         xyz   
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0      abd         xyz   
2                  Cust=acc hit,Data=125   3.0  400.0  acc hit         NaN   
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0      abc         xyz   
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0  abg nss         xaz   
5               Cust=evc,Region Info=atz   2.0    NaN      evc         atz   

  Data  
0  123  
1  124  
2  125  
3  126  
4  127  
5  NaN