我有一个数据框,其中包含一列布尔表达式,并且我想创建另一列,该列只是每个表达式的元素的列表。
EX
Name Exp
A DDDD | LLLL & AAAA
D HHHH | DDDD | JJJJ
O UUUU & FFFF & RRRR
结果df:
Name Exp Exp List
A DDDD | LLLL & AAAA ['DDDD','LLLL','AAAA']
D HHHH | DDDD | JJJJ ['HHHH','DDDD','JJJJ']
O UUUU & FFFF & RRRR ['UUUU','FFFF','RRRR']
答案 0 :(得分:5)
使用Series.str.findall
和正则表达式[a-zA-Z]+
提取单词:
df['Exp List'] = df['Exp'].str.findall(r'[a-zA-Z]+')
#alternative
#df['Exp List'] = df['Exp'].str.findall(r'\w+')
print (df)
Name Exp Exp List
0 A DDDD | LLLL & AAAA [DDDD, LLLL, AAAA]
1 D HHHH | DDDD | JJJJ [HHHH, DDDD, JJJJ]
2 O UUUU & FFFF & RRRR [UUUU, FFFF, RRRR]
使用Series.str.split
和带有可选空格的转义分隔符的解决方案是:
df['Exp List'] = df['Exp'].str.split(r'\s*\|\s*|\s*&\s*')
答案 1 :(得分:1)
如果Exp
列包含其他特殊字符,则@jezrael的回答将失败。
如果您知道布尔字符始终为|
或&
,则此实现有效:
>>> df = pd.DataFrame({'Name': ['A', 'D', 'O'],
'Exp': ['DDDD | L-LL & AAAA', 'HHHH | DDDD | JJJJ', 'UUUU& FFFF & RRRR']})
>>> df
Name Exp
0 A DDDD | L-LL & AAAA
1 D HHHH | DDDD | JJJJ
2 O UUUU & FFFF & RRRR
>>> df['Exp List'] = df['Exp'].str.split(r'\s*\||\s*&|\||\&')
>>> df
Name Exp Exp List
0 A DDDD | L-LL & AAAA [DDDD, L-LL, AAAA]
1 D HHHH | DDDD | JJJJ [HHHH, DDDD, JJJJ]
2 O UUUU & FFFF & RRRR [UUUU, FFFF, RRRR]