我有一个这样的数据框:
Id Column Val1 Val2
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0
2 Cust=acc hit,Data=125 3.0 400.0
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0
5 Cust=evc,Region Info=atz 2.0 NaN
我想将数据帧转换成这样:
Id Column Val1 Val2 Cust Region Info Data
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc xyz 123.0
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd xyz 124.0
2 Cust=acc hit,Data=125 3.0 400.0 acc hit NaN 125.0
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc xyz 126.0
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss xaz 127.0
5 Cust=evc,Region Info=atz 2.0 NaN evc atz NaN
从另一个 question 中,我得到了部分答案。
但是我如何处理键和值中的空格?
编辑:可能有多个键值对(示例中显示的除外)。所以我需要处理任意 'n' 个列的情况。
答案 0 :(得分:3)
Series.str.findall
我们可以使用带有正则表达式捕获组的 str.findall
从 key-value
列中提取 Id Column
对
df.join(pd.DataFrame(map(dict, df['Id Column'].str.findall(r'([^=,]+)=([^,]+)'))))
Id Column Val1 Val2 Cust Region Info Data
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc xyz 123
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd xyz 124
2 Cust=acc hit,Data=125 3.0 400.0 acc hit NaN 125
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc xyz 126
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss xaz 127
5 Cust=evc,Region Info=atz 2.0 NaN evc atz NaN
Regex
详情
([^=,]+)
:第一个捕获组
[^=,]+
:匹配列表中不存在的任何字符 [=,]
一次或多次=
:逐字匹配 =
字符([^,]+)
:第二个捕获组
[^,]+
:匹配列表中不存在的任何字符 [,]
一次或多次查看在线regex demo
答案 1 :(得分:2)
仅使用您显示的示例,请尝试以下操作。
import pandas as pd
df[["Cust Region","Info","Data"]] = df["IdColumn"].str.extract(r'^Cust=([^,]+)(?:,Region Info=([^,]*))?(?:,Data=(.*))?$', expand=True)
df
Here is the Online demo for used regex
输出如下:
IdColumn Val1 Val2 Cust Region Info Data
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc xyz 123
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd xyz 124
2 Cust=acc hit,Data=125 3.0 400.0 acc hit NaN 125
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc xyz 126
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss xaz 127
5 Cust=evc,Region Info=atz 2.0 NaN evc atz NaN
说明: 为上述正则表达式添加详细说明。
^Cust= ##Checking if value starts from Cust= here.
([^,]+) ##Creating 1st capturing group which has all values till , here.
(?:,Region Info= ##Starting a non-capturing group , Region Info= here.
([^,]*) ##Creating 2nd capturing group which has all values till , here.
)? ##Closing non-capturing group here.
(?:,Data= ##Creating non-capturing group which has ,Data= here.
(.*) ##Creating 3rd capturing group which has all values till end of value here.
)?$ ##Closing non-capturing group here at the end of line.
答案 2 :(得分:1)
在列 apply()
上使用 Id Column
并通过拆分获得值。
df['Cust Region'] = df['Id Column'].apply(lambda x: x.split(',')[0].split('=')[-1])
# print(df)
Id Column Val1 Val2 Cust Region
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd
2 Cust=acc hit,Data=125 3.0 400.0 acc hit
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss
5 Cust=evc,Region Info=atz 2.0 NaN evc
答案 3 :(得分:1)
使用列表推导式拆分 ,
然后 =
用于字典列表,因此可以传递给 DataFrame
构造函数:
L = [dict([y.split('=') for y in x.split(',')]) for x in df['Id Column']]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
Id Column Val1 Val2 Cust Region Info \
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc xyz
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd xyz
2 Cust=acc hit,Data=125 3.0 400.0 acc hit NaN
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc xyz
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss xaz
5 Cust=evc,Region Info=atz 2.0 NaN evc atz
Data
0 123
1 124
2 125
3 126
4 127
5 NaN