Question

我有一个看起来像df的pandas数据框，我想添加一个列，所以它看起来像df2。

import pandas as pd
df =pd.DataFrame({'Alternative' : ['a_x_17MAR2016_Collectedran30dom', 'b_17MAR2016_CollectedStuff', 'c_z_k_17MAR2016_Collectedan3dom'], 'Values': [34, 65, 7]})

df2 = pd.DataFrame({'Alternative' : ['a_x_17MAR2016_Collectedran30dom', 'b_17MAR2016_CollectedStuff', 'c_z_k_17MAR2016_Collectedan3dom'], 'Values': [34, 65, 7], 'Alts': ['a x 17MAR2016', 'b 17MAR2016', 'c z k 17MAR2016']})

    df
Out[4]: 
                       Alternative  Values
0  a_x_17MAR2016_Collectedran30dom      34
1       b_17MAR2016_CollectedStuff      65
2  c_z_k_17MAR2016_Collectedan3dom       7

df2
Out[5]: 
                       Alternative             Alts  Values
0  a_x_17MAR2016_Collectedran30dom    a x 17MAR2016      34
1       b_17MAR2016_CollectedStuff      b 17MAR2016      65
2  c_z_k_17MAR2016_Collectedan3dom  c z k 17MAR2016       7

换句话说，我有一个字符串，我可以用不同长度的下划线分隔符分隔。我想将它分开，然后将它与空格分开，但是在包含子字符串'Collected'的字符串开头后删除任何字符串。

我可以在单个列表中找到包含子字符串'Collected'的字符串的索引，因为我找到here然后组合其他字符串，但我似乎无法以非常'pythonic'的方式进行跨越所有数据框架。

提前致谢

Answer 1

我相信这会在技术上回答问题，但不符合所需的输出，因为日期不包含“收集”字样

df.Alternative.str.replace('_[^_]*Collected.*', '').str.replace('_', ' ')

输出

0      a x 17MAR2016
1        b 17MAR2016
2    c z k 17MAR2016

Answer 2

使用
str.split

alts = df.Alternative.str.split('_').str[:-1].str.join(' ')
df.insert(1, 'Alts', alts)
df

Answer 3

import re
x = df.Alternative.apply(lambda x : re.sub("_Collected.*","",x))
# x
#0      a_x_17MAR2016
#1        b_17MAR2016
#2    c_z_k_17MAR2016

y = x.str.split("_")
#0       [a, x, 17MAR2016]
#1          [b, 17MAR2016]
#2    [c, z, k, 17MAR2016] 

df['newcol'] = y.apply(lambda z: ' '.join(z))
#                       Alternative  Values           newcol
#0  a_x_17MAR2016_Collectedran30dom      34    a x 17MAR2016
#1       b_17MAR2016_CollectedStuff      65      b 17MAR2016
#2  c_z_k_17MAR2016_Collectedan3dom       7  c z k 17MAR2016

全部在一行：

import re
df['newcol'] = df.Alternative.apply(lambda x : re.sub("_Collected.*","",x)).str.split("_").apply(lambda z: ' '.join(z))

#                       Alternative  Values           newcol
#0  a_x_17MAR2016_Collectedran30dom      34    a x 17MAR2016
#1       b_17MAR2016_CollectedStuff      65      b 17MAR2016
#2  c_z_k_17MAR2016_Collectedan3dom       7  c z k 17MAR2016

返回pandas dataframe列，其中包含另一列的子字符串

3 个答案: