Question

给定数据帧df，（现实生活是+1000行df）。 ColB的元素是列表列表。

  ColA    ColB
0  'A'    [['a','b','c'],['d','e','f']]
1  'B'    [['f','g','h'],['i','j','k']]
2  'A'    [['l','m','n'],['o','p','q']]

如何使用不同列中的元素有效地创建一个字符串ColC，如下所示：

      ColC
'A>+a b:c,+d e:f'
'B>+f g:h,+i j:k'
'A>+l m:n,+o p:q'

我尝试使用df.apply这些行，inspired by this：

df['ColC'] = df.apply(lambda x:'%s>' % (x['ColA']),axis=1)

这适用于字符串的前2个元素。其余的都很难过。

Answer 1

这样的东西？

df['ColC']  = df.ColA + '>+' + df.ColB.str[0].str[0] + \
              ' ' + df.ColB.str[0].str[1] + ':' + \
              df.ColB.str[0].str[2] + ',+' + \
              df.ColB.str[1].str[0] + ' ' + \
              df.ColB.str[1].str[1] + ':' + \
              df.ColB.str[1].str[2]

输出：

  ColA                    ColB             ColC
0    A  [[a, b, c], [d, e, f]]  A>+a b:c,+d e:f
1    B  [[f, g, h], [i, j, k]]  B>+f g:h,+i j:k
2    A  [[l, m, n], [o, p, q]]  A>+l m:n,+o p:q

计时

df = pd.concat（[df] * 333）

温的方法

%% timeit df [[＆＃39; t1＆＃39;，＆＃39; t2＆＃39;]] = df [＆＃39; ColB＆＃39;]。apply（pd.Series）.applymap（ lambda x ：（＆＃39; {} {}：{}＆＃39; .format（x [0]，x [1]，x [2]）））df.ColA +＆＃39;＆gt; +＆＃ 39 + df.t1 +＆＃39;，+＆＃39 + df.t2

1个循环，每个循环最好为3：363毫秒

miradulo Method

%% timeit df.apply（lambda r：＆＃39; {}＆gt; + {} {}：{}，+ {} {}：{}＆＃39; .format（* flatten（r）），axis = 1）

10个循环，最佳3：每循环74.9毫秒

ScottBoston方法

%% timeit df.ColA +＆＃39;＆gt; +＆＃39; + df.ColB.str [0] .str [0] + \ ＆＃39; ＆＃39; + df.ColB.str [0] .str [1] +＆＃39;：＆＃39; + df.ColB.str [0] .str [2] +＆＃39;，+＆＃39; + df.ColB.str [1] .str [0] +＆＃39; ＆＃39; + df.ColB.str [1] .str [1] +＆＃39;：＆＃39; + df.ColB.str [1]名为.str [2]

100个循环，最佳3：每循环12.4毫秒

Answer 2

您使用apply

是对的

df[['t1','t2']]=df['colB'].apply(pd.Series).applymap(lambda x : ('{} {}:{}'.format(x[0],x[1],x[2])))
df.colA+'>+'+df.t1+',+'+df.t2
Out[648]: 
0    A>+a b:c,+d e:f
1    B>+f g:h,+i j:k
2    C>+l m:n,+o p:q

Answer 3

如果我们使用flatten功能如下

def flatten(l):
    for el in l:
        if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
            yield from flatten(el)
        else:
            yield el

如this answer所示，我们可以轻松apply使用展平元素进行字符串格式设置。

>>> df.apply(lambda r:'{}>+{} {}:{},+{} {}:{}'.format(*flatten(r.values)), axis=1)
0    A>+a b:c,+d e:f
1    B>+f g:h,+i j:k
2    A>+l m:n,+o p:q
dtype: object

这有望一概而论。

>>> row_formatter = lambda r: '{}>+{} {}:{},+{} {}:{}'.format(*flatten(r.values))
>>> df.apply(row_formatter, 1)
0    A>+a b:c,+d e:f
1    B>+f g:h,+i j:k
2    A>+l m:n,+o p:q
dtype: object

Answer 4

又一个答案：

df['ColC'] = df.apply(lambda x: '%s>+%s %s:%s,+%s%s:%s'% tuple([x['ColA']]+x['ColB'][0]+x['ColB'][1]),axis=1)

Answer 5

这里我的2美分还使用了apply

定义一个可以应用于数据框的函数，并使用字符串格式来解析列

def get_string(x):
    col_a = x.ColA
    col_b = (ch for ch in x.ColB if ch.isalnum())
    string = '{0}>+{1} {2}:{3},+{4} {5}:{6}'.format(col_a.strip("\'"), *col_b)
    return(string)

df['ColC'] = df.apply(get_string, axis=1)
df.ColC

0    A>+a b:c,+d e:f
1    B>+f g:h,+i j:k
2    A>+l m:n,+o p:q

我喜欢这个，因为它很容易修改格式，尽管以这种方式使用apply可能会很慢

如何从Python中的dataframe columns元素创建字符串？

5 个答案:

计时