删除Pandas列中每一行的空格后的字符

时间:2016-05-12 04:16:07

标签: python pandas

我有以下数据框:

import pandas as pd
import io
temp=u"""probe,genes,sample1
1415777_at,Pnliprp1 +OX(M6),20
1415777_at,Pllk +C6,20
1415884_at,Cela3b,47"""
df = pd.read_csv(io.StringIO(temp))
df

看起来像这样:

Out[23]:
        probe             genes  sample1
0  1415777_at  Pnliprp1 +OX(M6)       20
1  1415777_at          Pllk +C6       20
2  1415884_at            Cela3b       47

我想要做的是在gene列中删除空白后的每个字符 在每一行。所以它看起来像:

        probe             genes  sample1
0  1415777_at           Pnliprp1      20
1  1415777_at               Pllk      20
2  1415884_at             Cela3b      47

如何实现?

4 个答案:

答案 0 :(得分:5)

我使用向量化的str函数:

>>> df["genes"] = df["genes"].str.split().str[0]
>>> df
        probe     genes  sample1
0  1415777_at  Pnliprp1       20
1  1415777_at      Pllk       20
2  1415884_at    Cela3b       47

答案 1 :(得分:3)

使用split():

>>> df['genes'] = df['genes'].map(lambda x: x.split()[0])
>>> df
        probe     genes  sample1
0  1415777_at  Pnliprp1       20
1  1415777_at      Pllk       20
2  1415884_at    Cela3b       47

答案 2 :(得分:2)

您可以使用This link来捕获空格前的第一个组:

In [26]: df['genes'].str.extract('(\w*)\s*', expand=False)
Out[26]:
0    Pnliprp1
1        Pllk
2      Cela3b
Name: genes, dtype: object


df['genes'] = df['genes'].str.extract('(\w*)\s*', expand=False)
In [29]: df
Out[29]:
        probe     genes  sample1
0  1415777_at  Pnliprp1       20
1  1415777_at      Pllk       20
2  1415884_at    Cela3b       47

<强>时序

In [35]: %timeit df["genes"].str.split().str[0]
1000 loops, best of 3: 586 us per loop

In [36]: %timeit df['genes'].map(lambda x: x.split()[0])
10000 loops, best of 3: 153 us per loop

In [37]: %timeit df['genes'].str.extract('(\w*)\s*', expand=False)
1000 loops, best of 3: 173 us per loop

答案 3 :(得分:2)

最快的解决方案是使用listSeries构造函数的理解:

print pd.Series([ x.split()[0] for x in df['genes'].tolist() ])
0    Pnliprp1
1        Pllk
2      Cela3b
dtype: object

计时 len(df)=3k

df = pd.concat([df]*1000).reset_index(drop=True)

In [21]: %timeit pd.Series([ x.split()[0] for x in df['genes'].tolist() ])
1000 loops, best of 3: 946 µs per loop

In [22]: %timeit df['genes'].map(lambda x: x.split()[0])
1000 loops, best of 3: 1.27 ms per loop

In [23]: %timeit df['genes'].str.extract('(\w*)\s*', expand=False)
The slowest run took 4.31 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 5.08 ms per loop

In [24]: %timeit df["genes"].str.split().str[0]
100 loops, best of 3: 2.52 ms per loop

说明:

split()[0]效果更快,但如果列genes中的NaN值为safer,则失败。

我认为NaNDSM solution,因为与{{1}}合作非常好。