Question

从以下数据框中：

d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}

df = pd.DataFrame.from_dict(d)

我的最终目标是提取熊猫系列中的字母a，b或c（作为字符串）。为此，我正在使用.findall()模块中的re方法，如下所示：

# import the module
import re
# define the patterns
pat = 'a|b|c'

# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)

问题在于每行的输出（即字母a，b或c）将出现在（单个元素的）列表中，如下所示：

Out[301]: 
0    [a]
1    [b]
2    [c]
3    [a]

我希望将字母a，b或c作为字符串，如下所示：

我知道，如果将re.search()与.group()结合使用，我可以得到一个字符串，但是如果这样做：

df['col1'].str.search(pat).group()

我将收到以下错误消息：

AttributeError: 'StringMethods' object has no attribute 'search'

使用.str.split()不能完成任务，因为在我的原始数据帧中，我想捕获可能包含定界符的字符串（例如，我可能想捕获a-b）

有人知道一个简单的解决方案，也许避免诸如for循环或列表理解之类的迭代操作吗？

Answer 1

将extract用于捕获组：

import pandas as pd

d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}

df = pd.DataFrame.from_dict(d)

result = df['col1'].str.extract('(a|b|c)')

print(result)

输出

Answer 2

修正您的代码

pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]: 
0    a
1    b
2    c
3    a
Name: col1, dtype: object

Answer 3

只需像这样df["col1"].str.split("-", n = 1, expand = True)尝试使用str.split()

import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True) 
print(df.head())

输出：

  col1
0    a
1    b
2    c
3    a

使用正则表达式从熊猫数据框中提取元素

3 个答案: