我有以下数据框:
import pandas as pd
import io
temp=u"""probe,genes,sample1
1415777_at,Pnliprp1 +OX(M6),20
1415777_at,Pllk +C6,20
1415884_at,Cela3b,47"""
df = pd.read_csv(io.StringIO(temp))
df
看起来像这样:
Out[23]:
probe genes sample1
0 1415777_at Pnliprp1 +OX(M6) 20
1 1415777_at Pllk +C6 20
2 1415884_at Cela3b 47
我想要做的是在gene
列中删除空白后的每个字符
在每一行。所以它看起来像:
probe genes sample1
0 1415777_at Pnliprp1 20
1 1415777_at Pllk 20
2 1415884_at Cela3b 47
如何实现?
答案 0 :(得分:5)
我使用向量化的str
函数:
>>> df["genes"] = df["genes"].str.split().str[0]
>>> df
probe genes sample1
0 1415777_at Pnliprp1 20
1 1415777_at Pllk 20
2 1415884_at Cela3b 47
答案 1 :(得分:3)
使用split():
>>> df['genes'] = df['genes'].map(lambda x: x.split()[0])
>>> df
probe genes sample1
0 1415777_at Pnliprp1 20
1 1415777_at Pllk 20
2 1415884_at Cela3b 47
答案 2 :(得分:2)
您可以使用This link来捕获空格前的第一个组:
In [26]: df['genes'].str.extract('(\w*)\s*', expand=False)
Out[26]:
0 Pnliprp1
1 Pllk
2 Cela3b
Name: genes, dtype: object
df['genes'] = df['genes'].str.extract('(\w*)\s*', expand=False)
In [29]: df
Out[29]:
probe genes sample1
0 1415777_at Pnliprp1 20
1 1415777_at Pllk 20
2 1415884_at Cela3b 47
<强>时序强>:
In [35]: %timeit df["genes"].str.split().str[0]
1000 loops, best of 3: 586 us per loop
In [36]: %timeit df['genes'].map(lambda x: x.split()[0])
10000 loops, best of 3: 153 us per loop
In [37]: %timeit df['genes'].str.extract('(\w*)\s*', expand=False)
1000 loops, best of 3: 173 us per loop
答案 3 :(得分:2)
最快的解决方案是使用list
对Series
构造函数的理解:
print pd.Series([ x.split()[0] for x in df['genes'].tolist() ])
0 Pnliprp1
1 Pllk
2 Cela3b
dtype: object
计时 len(df)=3k
:
df = pd.concat([df]*1000).reset_index(drop=True)
In [21]: %timeit pd.Series([ x.split()[0] for x in df['genes'].tolist() ])
1000 loops, best of 3: 946 µs per loop
In [22]: %timeit df['genes'].map(lambda x: x.split()[0])
1000 loops, best of 3: 1.27 ms per loop
In [23]: %timeit df['genes'].str.extract('(\w*)\s*', expand=False)
The slowest run took 4.31 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 5.08 ms per loop
In [24]: %timeit df["genes"].str.split().str[0]
100 loops, best of 3: 2.52 ms per loop
说明:
split()[0]
效果更快,但如果列genes
中的NaN
值为safer
,则失败。
我认为NaN
是DSM
solution,因为与{{1}}合作非常好。