所以我在一列中有一些带有一些文本的数据框。 我试图在列的每一行中找到2个字符串,然后在这两个字符串之间切换行文本以获取子字符串。像这样:
startinds = df[column].str.find("First Event = ")
endinds = df[column].str.find("\nLast Event = ")
df["first_timestamp"] = df[column].str.slice(startinds,endinds)
现在这不起作用,因为startinds
和endinds
是系列,因此我不能将它们用作切片column
中字符串的索引。
任何人都知道我可以访问值以在每行上执行子串的方法吗?
示例输入:
Data
0 "Blahblah
First Event = 09/20/2017 12:00:00
Last Event = 09/20/2017 13:00:00
Blahblahblah"
1 "Blahblahblahblah
Blahablahblah
First Event = 09/20/2017 12:30:00
Last Event = 09/20/2017 12:45:00
Blahblahblah"
输出:
first_timestamp
0 "First Event = 09/20/2017 12:00:00"
1 "First Event = 09/20/2017 12:30:00"
答案 0 :(得分:3)
要完成切片方法,您可以使用lambda,即将startinds
和endinds
存储在df中,然后根据列使用lambda切换字符串即(请注意,您需要一个转义字符才能获得\n
)
df['startinds'] = df['Data'].str.find("First Event = ")
df['endinds'] = df['Data'].str.find("\\nLast Event = ")
df.apply(lambda x : str(x['Data'])[x['startinds']:x['endinds']],1 )
输出:
0 First Event = 09/20/2017 12:00:00 1 First Event = 09/20/2017 12:30:00 dtype: object
答案 1 :(得分:2)
与评论中的答案不同,Series.str.extract
的这种方法应该有效:
df['first_timestamp'] = df['Data'].str.extract('(First Event = .+)')
# Data \
# 0 Blahblah\nFirst Event = 09/20/2017 12:00:00\nL...
# 1 Blahblahblahblah\nFirst Event = 09/20/2017 12:...
#
# first_timestamp
# 0 First Event = 09/20/2017 12:00:00
# 1 First Event = 09/20/2017 12:30:00
模式'(First Event = .+)'
捕获一个组(即()
),其中“First Event =”后跟一个或多个字符(即.+
),停在换行符处({{ 1}}字符匹配除换行符之外的任何内容。)