将自定义函数应用于现有列以输出多个列

时间:2018-02-03 23:33:10

标签: python pandas

这是我的首发df:

import numpy as np
import pandas as pd

df = pd.DataFrame(['alpha', 'beta'], columns = ['text'])
df
    text
0   alpha
1   beta

这是我想要的最终结果:

    text    first           second          third
0   alpha   alpha-first     alpha-second    alpha-third
1   beta    beta-first      beta-second     beta-third

我编写了自定义函数parse(),没有问题:

def parse(text):
    return [text + ' first', text + ' second', text + ' third']

现在我尝试将parse()应用于初始df,这是出现错误的地方:

1)如果我尝试以下方法:

df = df.reindex(columns = list(df.columns) + ['first', 'second', 'third']) # Create empty columns    
df[['first', 'second', 'third']] = df.text.apply(parse)

我明白了:

ValueError: Must have equal len keys and value when setting with an ndarray

2)版本略有不同:

df = df.reindex(columns = list(df.columns) + ['first', 'second', 'third']).astype(object) # Create empty columns of "object" type    
df[['first', 'second', 'third']] = df.text.apply(parse)

我明白了:

ValueError: shape mismatch: value array of shape (2,) could not be broadcast 
to indexing result of shape (3,2)

我哪里错了?

修改

我应该澄清parse()本身在我试图解决的现实问题中是一个更复杂的功能。 (它需要一个段落,在其中找到3种特定类型的字符串,并将这些字符串输出为长度为3的列表)。在我上面的代码中,我对parse()作为替代的一个有点随机的简单定义,以避免陷入与我得到的两个错误无关的细节中。

4 个答案:

答案 0 :(得分:2)

无需apply

import pandas as pd

df = pd.DataFrame(['alpha', 'beta'], columns = ['text'])

for i in ['first', 'second', 'third']:
    df[i] = df.text + '-' + i

#     text       first       second       third
# 0  alpha  alpha-first  alpha-second  alpha-third
# 1   beta   beta-first   beta-second   beta-third

通常,为您的计算选择的“流程类型”的层次结构应为:

  1. 矢量化计算,如上所述。
  2. pd.Series.apply
  3. pd.DataFrame.apply
  4. pd.DataFrame.iterrows

答案 1 :(得分:1)

这可以通过以下几种方式完成:

选项1:

def f(s):
    return pd.DataFrame(np.repeat(s, 3).values.reshape(len(s), -1),
                        columns=['first','second','third']) \
             .apply(lambda c: c+'-'+c.name)


In [183]: df[['first','second','third']] = f(df.text)

In [184]: df
Out[184]:
    text        first        second        third
0  alpha  alpha-first  alpha-second  alpha-third
1   beta   beta-first   beta-second   beta-third

答案 2 :(得分:1)

这里是pd.DataFrame.assign的单行:

df.assign(**{x: df['text']+'-'+x for x in ['first', 'second', 'third']})

#     text        first        second        third
# 0  alpha  alpha-first  alpha-second  alpha-third
# 1   beta   beta-first   beta-second   beta-third

答案 3 :(得分:0)

检查一下:

istr.ignore(std::numeric_limits<std::streamsize>::max());
max = istr.gcount();