我的数据框为:
df = pd.DataFrame({"id": [1,2,3,4,5],
"text": ["This is a ratio of 13.4/10","Favorate rate of this id is 11/9","It may not be a good looking person. But he is vary popular (15/10)","Ratio is 12/10","very popular 17/10"],
"name":["Joe","Adam","Sara","Jose","Bob"]})
我想将数字提取到两列中以得到以下结果:
df = pd.DataFrame({"id": [1,2,3,4,5],
"text": ["This is a ratio of 13.4/10","Favorate rate of this id is 11/9","It may not be a good looking person. But he is vary popular (15/10)","Ratio is 12/10","very popular 17/10"],
"name":["Joe","Adam","Sara","Jose","Bob"],
"rating_nominator":[13.4,11,15,12,17],
"rating_denominator":[10,9,10,10,10]})
感谢您的帮助。
答案 0 :(得分:2)
您要匹配的常规模式是(some number)/(other number)
。匹配浮点数不是一件容易的事,SO上有很多答案可以回答这个问题,因此您可以在这里使用它。
从this question改编而成的相当健壮的表达式是([+-]?(?:[0-9]*[.])?[0-9]+)
。您可以将其与Series.str.extract
和f字符串一起使用:
fpr = r'([+-]?(?:[0-9]*[.])?[0-9]+)'
res = df.text.str.extract(fr'{fpr}\/{fpr}').astype(float)
0 1
0 13.4 10.0
1 11.0 9.0
2 15.0 10.0
3 12.0 10.0
4 17.0 10.0
要将其分配给您的DataFrame:
df[['rating_nominator', 'rating_denominator']] = res
id text name rating_nominator rating_denominator
0 1 This is a ratio of 13.4/10 Joe 13.4 10.0
1 2 Favorate rate of this id is 11/9 Adam 11.0 9.0
2 3 It may not be a good looking person. But he is... Sara 15.0 10.0
3 4 Ratio is 12/10 Jose 12.0 10.0
4 5 very popular 17/10 Bob 17.0 10.0
答案 1 :(得分:2)
您可以使用
df[['rating_nominator', 'rating_denominator']] = df['text'].str.extract('(-?\d+(?:\.\d+)?)/(-?\d+(?:\.\d+)?)').astype(float)
正则表达式(-?\d+(?:\.\d+)?)/(-?\d+(?:\.\d+)?)
将捕获整数或浮点数作为分母或分母。
(编辑:this answer中的正则表达式涵盖了更多情况。我做出了一些假设,例如,您不会在数字中找到一元+
符号。)
演示:
>>> df
id text
0 1 foo 14.12/10.123 bar
1 2 10/12
2 3 13.4/14.5
3 4 -12.24/-13.5
4 5 1/-1.2
>>>
>>> df[['rating_nominator', 'rating_denominator']] = df['text'].str.extract('(-?\d+(?:\.\d+)?)/(-?\d+(?:\.\d+)?)').astype(float)
>>> df
id text rating_nominator rating_denominator
0 1 foo 14.12/10.123 bar 14.12 10.123
1 2 10/12 10.00 12.000
2 3 13.4/14.5 13.40 14.500
3 4 -12.24/-13.5 -12.24 -13.500
4 5 1/-1.2 1.00 -1.20