Question

我得到了以下带有示例数据的csv文件：

现在，我想用这些列的括号内的数据替换“ SIFT”和“ PolyPhen”列的值。因此，对于第1行，SIFT值将替换为0.82，对于第2行，SIFT值将为0.85。另外，我希望在括号内的部分容忍/不友好，位于名为“ SIFT_prediction”的新列中。

这是我到目前为止尝试过的：

import pandas as pd
import re

testfile = 'test_sift_columns.csv'
df = pd.read_csv(testfile)  
df['SIFT'].re.search(r'\((.*?)\)',s).group(1)

此代码将把所有内容都包含在SIFT列的括号内。但这并不能替代任何东西。我可能需要一个for循环来读取和替换每一行，但我不知道如何正确执行。另外我不确定熊猫是否需要使用正则表达式。也许有一种更聪明的方式来解决我的问题。

Answer 1

使用Series.str.extract：

df = pd.DataFrame({'SIFT':['tol(0.82)','tol(0.85)','tol(1.42)'],
                   'PolyPhen':['beg(0)','beg(0)','beg(0)']})

pat = r'(.*?)\((.*?)\)'
df[['SIFT_prediction','SIFT']] = df['SIFT'].str.extract(pat)
df[['PolyPhen_prediction','PolyPhen']] = df['PolyPhen'].str.extract(pat)

print(df)
  SIFT_prediction  SIFT PolyPhen_prediction PolyPhen
0             tol  0.82                 beg        0
1             tol  0.85                 beg        0
2             tol  1.42                 beg        0

替代：

df[['SIFT_prediction','SIFT']] = df['SIFT'].str.rstrip(')').str.split('(', expand=True)
df[['PolyPhen_prediction','PolyPhen']] = df['PolyPhen'].str.rstrip(')').str.split('(', expand=True)

Answer 2

您可以执行一些操作，例如用空字符串替换所有字母数字值以获取浮点值，反之则获取谓词。

import pandas as pd

df = pd.DataFrame({'ID': [1,2,3,4], 'SIFT': ['tolerated(0.82)', 'tolerated(0.85)', 'tolerated(0.25)', 'dedicated(0.5)']})
df['SIFT_formatted'] = df.SIFT.str.replace('[^0-9.]', '', regex=True).astype(float)
df['SIFT_prediction'] = df.SIFT.str.replace('[^a-zA-Z]', '', regex=True)
df

会给你：

    ID  SIFT            SIFT_formatted  SIFT_prediction
0   1   tolerated(0.82) 0.82             tolerated
1   2   tolerated(0.85) 0.85             tolerated
2   3   tolerated(0.25) 0.25             tolerated
3   4   dedicated(0.5)  0.50             dedicated

使用python pandas将csv列内的值替换为同一列括号内的值

2 个答案: