熊猫分裂列

时间:2016-03-17 05:23:12

标签: python-3.x pandas split

给出以下数据框:

import pandas as pd
import numpy as np
df = pd.DataFrame({
       'A' : ['a', 'b','c', 'd'],
       'B' : ['Y>`abcd', 'abcd','efgh', 'Y>`efgh']
    })
df

    A   B
0   a   Y>`abcd
1   b   abcd
2   c   efgh
3   d   Y>`efgh

我想将'>`'上的A列拆分为2列(C和D),以便我的数据

frame looks like this:
        A   C  D
    0   a   Y  abcd
    1   b      abcd
    2   c      efgh
    3   d   Y  efgh

提前致谢!

4 个答案:

答案 0 :(得分:2)

执行str.split后跟apply返回pd.Series将创建新列:

>>> df.B.str.split('>').apply(
    lambda l: pd.Series({'C': l[0], 'D': l[1][1: ]}) if len(l) == 2 else \
        pd.Series({'C': '', 'D': l[0]}))
    C   D
0   Y   abcd
1       abcd
2       efgh
3   Y   efgh

因此,您可以concat向DataFrame发送del,并df = pd.concat([df, df.B.str.split('>').apply( lambda l: pd.Series({'C': l[0], 'D': l[1][1: ]}) if len(l) == 2 else \ pd.Series({'C': '', 'D': l[0]}))], axis=1) del df['B'] >>> df A C D 0 a Y abcd 1 b abcd 2 c efgh 3 d Y efgh 原始列:

var region= getURLParameter('region');

答案 1 :(得分:2)

您可以str.extractfillna一起使用dropstr.split使用最后一个删除列B

df[['C','D']] = df['B'].str.extract('(.*)>`(.*)', expand=True)
df['D'] = df['D'].fillna(df['B'])
df['C'] = df['C'].fillna('')
df = df.drop('B', axis=1)

print df

   A  C     D
0  a  Y  abcd
1  b     abcd
2  c     efgh
3  d  Y  efgh

下一个解决方案使用numpy.wheremask和{{3}}:

df[['C','D']] =  df['B'].str.split('>`', expand=True) 
mask = pd.notnull(df['D'])
df['D'] = df['D'].fillna(df['C'])
df['C'] = np.where(mask, df['C'], '')
df = df.drop('B', axis=1) 

<强>计时

在大DataFrame extract解决方案100解决方案1.5次,速度较快len(df)=4次:

In [438]: %timeit a(df) 100 loops, best of 3: 2.96 ms per loop In [439]: %timeit b(df1) 1000 loops, best of 3: 1.86 ms per loop In [440]: %timeit c(df2) The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached 1000 loops, best of 3: 1.89 ms per loop In [441]: %timeit d(df3) The slowest run took 4.62 times longer than the fastest. This could mean that an intermediate result is being cached 1000 loops, best of 3: 1.82 ms per loop

len(df)=4k

In [443]: %timeit a(df) 1 loops, best of 3: 799 ms per loop In [444]: %timeit b(df1) The slowest run took 4.19 times longer than the fastest. This could mean that an intermediate result is being cached 100 loops, best of 3: 7.37 ms per loop In [445]: %timeit c(df2) 1 loops, best of 3: 552 ms per loop In [446]: %timeit d(df3) 100 loops, best of 3: 9.55 ms per loop

import pandas as pd
df = pd.DataFrame({
       'A' : ['a', 'b','c', 'd'],
       'B' : ['Y>`abcd', 'abcd','efgh', 'Y>`efgh']
    })
#for test 4k    
df = pd.concat([df]*1000).reset_index(drop=True)
df1,df2,df3 = df.copy(),df.copy(),df.copy()

def b(df):
    df[['C','D']] = df['B'].str.extract('(.*)>`(.*)', expand=True)
    df['D'] = df['D'].fillna(df['B'])
    df['C'] = df['C'].fillna('')
    df = df.drop('B', axis=1)
    return df

def a(df):
    df = pd.concat([df, df.B.str.split('>').apply(
    lambda l: pd.Series({'C': l[0], 'D': l[1][1: ]}) if len(l) == 2 else \
        pd.Series({'C': '', 'D': l[0]}))], axis=1)
    del df['B']
    return df

def c(df):
    df[['C','D']] = df['B'].str.split('>`').apply(lambda x: pd.Series(['']*(2-len(x)) + x))
    df = df.drop('B', axis=1)    
    return df   

def d(df):
    df[['C','D']] =  df['B'].str.split('>`', expand=True) 
    mask = pd.notnull(df['D'])
    df['D'] = df['D'].fillna(df['C'])
    df['C'] = np.where(mask, df['C'], '')
    df = df.drop('B', axis=1) 
    return df  

代码:

tblObject.estimatedRowHeight = 300;
tblObject.rowHeight = UITableViewAutomaticDimension;

答案 2 :(得分:1)

我会使用一个班轮:

df['B'].str.split('>`').apply(lambda x: pd.Series(['']*(2-len(x)) + x))

#   0     1
#0  Y  abcd
#1     abcd
#2     efgh
#3  Y  efgh

答案 3 :(得分:0)

最简单,最节省内存的方式是:

df[['C', 'D']] = df.B.str.split('>`', expand=True)