Question

我在熊猫中有以下数据框

ID     text
1      T7MS1
2      T5HS2
3      T3XP1
4      Tank_3
5      TANK 5
6      System

我想从text列中提取数字，该列遵循以下模式

MS，HS和XP之后的数字，TANK之后的数字，Tank_之后的数字_

所需数据框

ID     text       new_text
1      T7MS1      1
2      T5HS2      2 
3      T3XP1      1
4      Tank_3     3
5      TANK 5     5
6      System     System

我可以按照1种模式进行操作

 m = re.search('TANK (\d+)', 'TANK 5', re.IGNORECASE)
 m.group(1)

但是如何将所有模式组合成一个模式并将其应用于列。

Answer 1

使用以下正则表达式组合所有前缀：

(?:MS|HS|XP|TANK |Tank_)(\d+)

由于我在前缀中使用了“非捕获组” (?: )，因此您的目标号码仍在第1组中，就像您的代码一样。

Answer 2

带有if语句的简单函数可以解决您的问题：

s = pd.Series(["T7MS1","Tank_3","TANK 5", "System"])

pattern= "[MS|HS|XP|TANK |Tank_](\d+)"
def fetch_num(txt):
    result = re.findall(pattern,txt)
    if result: # if matched
        return result[0]
    else:
        return txt

s.apply(fetch_num)

或者，如果您不想在特定单词后匹配数字，则可以使用此模式r"\d+$"。
模式中的$表示字符串的结尾。

它返回：

0         1
1         3
2         5
3    System
dtype: object

Answer 3

如果数字始终是术语中的最后一个字符，则可以简单地使用Pandas Series字符串方法，如下所示：

df['new_text'] = df.text.str.slice(-1)

否则，由于中间不希望有数字，因此，如果有关于参数的更多信息，则可以使用RegEx解决方案。

Answer 4

如果数字始终是最后一个字母，则只需使用Series.str[-1]：

或者如果您只想在MS，HS和XP，TANK和Tank_之后输入数字：

df= pd.DataFrame({'id': [1, 2, 3, 4, 5],
                 'text': ['T7MS1', 'T5HS2', 'T3XP1', 'Tank_3', 'TANK 5']})
df

    id  text
0   1   T7MS1
1   2   T5HS2
2   3   T3XP1
3   4   Tank_3
4   5   TANK 5


df['new_text'] = df.text.str[-1]
df

   id   text    new_text
0   1   T7MS1    1
1   2   T5HS2    2
2   3   T3XP1    1
3   4   Tank_3   3
4   5   TANK 5   5

您可以使用以下方法填写空值：

df['new_text'] = df.text.str.extract(r'(?:MS|HS|XP|TANK |Tank_)(\d+)')
df

id  text    new_text
0   1   T7MS1    1
1   2   T5HS2    2
2   3   T3XP1    1
3   4   Tank_3   3
4   5   TANK 5   5

Answer 5

我想从下面的文本列中提取数字   模式

MS，HS和XP之后的数字，TANK之后的数字，Tank_之后的数字_

l = ['MS','HS','XP','TANK','Tank_']
t['new_text'] = t['text'].apply(lambda x: re.findall(r'(?<=[{}\s])\d'.format( [d for d in l if d in x][0]),x)[0])

输出

   ID    text new_text
0   1   T7MS1        1
1   2   T5HS2        2
2   3   T3XP1        1
3   4  Tank_3        3
4   5  TANK 5        5

已更新

使用Alexis正则表达式

t['text'].apply(lambda x: re.findall(r'(?:MS|HS|XP|TANK |Tank_)(\d+)', x)[0] if re.findall(r'(?:MS|HS|XP|TANK |Tank_)(\d+)', x) else x)

输出

    ID    text new_text
0   1   T7MS1        1
1   2   T5HS2        2
2   3   T3XP1        1
3   4  Tank_3        3
4   5  TANK 5        5
5   6  System   System

如何在熊猫的字符串模式后提取数字

5 个答案: