Question

我有一个数据框，其中一列包含'weak=30'类型的字符串，我想提取=字符串之后的数字并创建一个名为digits的新列。

我使用re.search来查找数字，但到目前为止，它仍然有错误。

示例数据

import pandas as pd
import re

raw_data = {'patient': [1, 2, 3,4, 6],
        'treatment': [0, 1, 0, 1, 0],
        'score': ['strong=42', 'weak=30', 'weak=12', 'pitt=12', 'strong=42']}

df = pd.DataFrame(raw_data, columns = ['patient', 'treatment', 'score'])

df

   patient  treatment      score
0        1          0  strong=42
1        2          1    weak=30
2        3          0    weak=12
3        4          1    pitt=12
4        6          0  strong=42

所以我尝试了

df=df.assign(digits=[int(re.search(r'\d+', x)) for x in df.score])

TypeError：int（）参数必须是字符串，类似字节的对象或数字，而不是're.Match'

在R中，您可以做

mutate(digits=as.numeric(gsub(".*=","",score))

python pandas中的等效函数是什么？

预期产量

   patient  treatment      score   digits
0        1          0  strong=42     42
1        2          1    weak=30     30
2        3          0    weak=12     12
3        4          1    pitt=12     12
4        6          0  strong=42     42

Answer 1

您可以将str.replace与R regex一起使用：

df['digits'] = df['score'].str.replace(r'.*=', '').astype(int)

.*=模式尽可能匹配除换行符之外的所有0+个字符，直到最后一个=，并且replace与''的匹配将删除此文本。

或者，您可以使用该方法在字符串末尾=之后提取数字：

df['digits'] = df['score'].str.extract(r'=(\d+)$', expand=False).astype(int)

在这里，=(\d+)$与=匹配，然后将一个或多个数字捕获到组1中，然后在字符串的末尾声明位置。

两种情况下的输出为：

>>> df
   patient  treatment      score  digits
0        1          0  strong=42      42
1        2          1    weak=30      30
2        3          0    weak=12      12
3        4          1    pitt=12      12
4        6          0  strong=42      42

Answer 2

re.search返回MatchObject，而不直接返回匹配的字符串。参见https://docs.python.org/3.7/library/re.html#match-objects

如果您想要字符串，可以尝试以下方式：

re.search(r'\d+', x).group(0)

在熊猫数据框中某些字符之后提取数字

2 个答案: