Question

有没有办法在pandas dataframe列中拆分字符串

coordinates(gDNA)
chr10:g.89711916T>A

进入制表符分隔的字段

chr\start\ref\alt

chr10\t89711916\tT\tA

在熊猫中。

到目前为止，我已经尝试过

df[['chr','others']] = df['coordinates(gDNA)'].str.split(':',expand=True)

并提取了第一部分，但不确定其余部分怎么做

Answer 1

使用：

df[['chr','start', 'alt']] = df['coordinates(gDNA)'].str.split(':g.|>',expand=True)
df[['start','ref']] = df['start'].str.extract('(\d+)(\D+)')
print (df)
     coordinates(gDNA)    chr     start alt ref
0  chr10:g.89711916T>A  chr10  89711916   A   T

Answer 2

尝试一下：

df[['chr','start','ref','alt']] = df['coordinates(gDNA)'].str.extract('(\w+).*?(\d+)(\w+).*?(\w+)')

Answer 3

df = pd.DataFrame(
    columns=['coordinates(gDNA)'],
    data=[['chr10:g.89711916T>A']]
)

def parser(x):
    ch, x = x.split(':g.')
    start = int(x[:-3])
    ref = x[-3]
    alt = x[-1]
    return dict(chr=ch, start=start, ref=ref, alt=alt)

pd.DataFrame([*map(parser, df['coordinates(gDNA)'])], df.index)


  alt    chr ref     start
0   A  chr10   T  89711916

将字符串分割为多个定界符

3 个答案: