使用另一列中的值切片pandas列

时间:2017-09-20 13:54:06

标签: python python-2.7 pandas substring

所以我在一列中有一些带有一些文本的数据框。 我试图在列的每一行中找到2个字符串,然后在这两个字符串之间切换行文本以获取子字符串。像这样:

startinds = df[column].str.find("First Event = ")
endinds   = df[column].str.find("\nLast Event = ")

df["first_timestamp"] = df[column].str.slice(startinds,endinds)

现在这不起作用,因为startindsendinds是系列,因此我不能将它们用作切片column中字符串的索引。

任何人都知道我可以访问值以在每行上执行子串的方法吗?

示例输入:

    Data
0   "Blahblah
     First Event = 09/20/2017 12:00:00
     Last Event = 09/20/2017 13:00:00
     Blahblahblah"
1   "Blahblahblahblah
     Blahablahblah
     First Event = 09/20/2017 12:30:00
     Last Event = 09/20/2017 12:45:00
     Blahblahblah"

输出:

    first_timestamp
0   "First Event = 09/20/2017 12:00:00"
1   "First Event = 09/20/2017 12:30:00"

2 个答案:

答案 0 :(得分:3)

要完成切片方法,您可以使用lambda,即将startindsendinds存储在df中,然后根据列使用lambda切换字符串即(请注意,您需要一个转义字符才能获得\n

df['startinds'] = df['Data'].str.find("First Event = ")
df['endinds']  = df['Data'].str.find("\\nLast Event = ")

df.apply(lambda x : str(x['Data'])[x['startinds']:x['endinds']],1 )

输出:

0    First Event = 09/20/2017 12:00:00
1    First Event = 09/20/2017 12:30:00
dtype: object

答案 1 :(得分:2)

与评论中的答案不同,Series.str.extract的这种方法应该有效:

df['first_timestamp'] = df['Data'].str.extract('(First Event = .+)')

#                                                 Data  \
# 0  Blahblah\nFirst Event = 09/20/2017 12:00:00\nL...   
# 1  Blahblahblahblah\nFirst Event = 09/20/2017 12:...   
# 
#                      first_timestamp  
# 0  First Event = 09/20/2017 12:00:00  
# 1  First Event = 09/20/2017 12:30:00

模式'(First Event = .+)'捕获一个组(即()),其中“First Event =”后跟一个或多个字符(即.+),停在换行符处({{ 1}}字符匹配除换行符之外的任何内容。)