给出以下数据框:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A' : ['a', 'b','c', 'd'],
'B' : ['Y>`abcd', 'abcd','efgh', 'Y>`efgh']
})
df
A B
0 a Y>`abcd
1 b abcd
2 c efgh
3 d Y>`efgh
我想将'>`'上的A列拆分为2列(C和D),以便我的数据
frame looks like this:
A C D
0 a Y abcd
1 b abcd
2 c efgh
3 d Y efgh
提前致谢!
答案 0 :(得分:2)
执行str.split
后跟apply
返回pd.Series
将创建新列:
>>> df.B.str.split('>').apply(
lambda l: pd.Series({'C': l[0], 'D': l[1][1: ]}) if len(l) == 2 else \
pd.Series({'C': '', 'D': l[0]}))
C D
0 Y abcd
1 abcd
2 efgh
3 Y efgh
因此,您可以concat
向DataFrame发送del
,并df = pd.concat([df, df.B.str.split('>').apply(
lambda l: pd.Series({'C': l[0], 'D': l[1][1: ]}) if len(l) == 2 else \
pd.Series({'C': '', 'D': l[0]}))],
axis=1)
del df['B']
>>> df
A C D
0 a Y abcd
1 b abcd
2 c efgh
3 d Y efgh
原始列:
var region= getURLParameter('region');
答案 1 :(得分:2)
您可以str.extract
与fillna
一起使用drop
,str.split
使用最后一个删除列B
:
df[['C','D']] = df['B'].str.extract('(.*)>`(.*)', expand=True)
df['D'] = df['D'].fillna(df['B'])
df['C'] = df['C'].fillna('')
df = df.drop('B', axis=1)
print df
A C D
0 a Y abcd
1 b abcd
2 c efgh
3 d Y efgh
下一个解决方案使用numpy.where
与mask
和{{3}}:
df[['C','D']] = df['B'].str.split('>`', expand=True)
mask = pd.notnull(df['D'])
df['D'] = df['D'].fillna(df['C'])
df['C'] = np.where(mask, df['C'], '')
df = df.drop('B', axis=1)
<强>计时强>:
在大DataFrame
extract
解决方案100
解决方案1.5
次,速度较快len(df)=4
次:
In [438]: %timeit a(df)
100 loops, best of 3: 2.96 ms per loop
In [439]: %timeit b(df1)
1000 loops, best of 3: 1.86 ms per loop
In [440]: %timeit c(df2)
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 1.89 ms per loop
In [441]: %timeit d(df3)
The slowest run took 4.62 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 1.82 ms per loop
:
len(df)=4k
In [443]: %timeit a(df)
1 loops, best of 3: 799 ms per loop
In [444]: %timeit b(df1)
The slowest run took 4.19 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 7.37 ms per loop
In [445]: %timeit c(df2)
1 loops, best of 3: 552 ms per loop
In [446]: %timeit d(df3)
100 loops, best of 3: 9.55 ms per loop
:
import pandas as pd
df = pd.DataFrame({
'A' : ['a', 'b','c', 'd'],
'B' : ['Y>`abcd', 'abcd','efgh', 'Y>`efgh']
})
#for test 4k
df = pd.concat([df]*1000).reset_index(drop=True)
df1,df2,df3 = df.copy(),df.copy(),df.copy()
def b(df):
df[['C','D']] = df['B'].str.extract('(.*)>`(.*)', expand=True)
df['D'] = df['D'].fillna(df['B'])
df['C'] = df['C'].fillna('')
df = df.drop('B', axis=1)
return df
def a(df):
df = pd.concat([df, df.B.str.split('>').apply(
lambda l: pd.Series({'C': l[0], 'D': l[1][1: ]}) if len(l) == 2 else \
pd.Series({'C': '', 'D': l[0]}))], axis=1)
del df['B']
return df
def c(df):
df[['C','D']] = df['B'].str.split('>`').apply(lambda x: pd.Series(['']*(2-len(x)) + x))
df = df.drop('B', axis=1)
return df
def d(df):
df[['C','D']] = df['B'].str.split('>`', expand=True)
mask = pd.notnull(df['D'])
df['D'] = df['D'].fillna(df['C'])
df['C'] = np.where(mask, df['C'], '')
df = df.drop('B', axis=1)
return df
代码:
tblObject.estimatedRowHeight = 300;
tblObject.rowHeight = UITableViewAutomaticDimension;
答案 2 :(得分:1)
我会使用一个班轮:
df['B'].str.split('>`').apply(lambda x: pd.Series(['']*(2-len(x)) + x))
# 0 1
#0 Y abcd
#1 abcd
#2 efgh
#3 Y efgh
答案 3 :(得分:0)
最简单,最节省内存的方式是:
df[['C', 'D']] = df.B.str.split('>`', expand=True)