Question

我有多个pandas数据框，可能有不同数量的列，这些列的数量通常在50到100之间变化。我需要创建一个最终列，它只是连接的所有列。基本上，列的第一行中的字符串应该是所有列的第一行上的字符串的总和（串联）。我在下面写了循环，但我觉得可能有更好的方法来做到这一点。关于如何做到这一点的任何想法

num_columns = df.columns.shape[0]
col_names = df.columns.values.tolist()
df.loc[:, 'merged'] = ""
for each_col_ind in range(num_columns):
    print('Concatenating', col_names[each_col_ind])
    df.loc[:, 'merged'] = df.loc[:, 'merged'] + df[col_names[each_col_ind]]

Answer 1

df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})

df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')

给我们：

df
Out[6]: 
   A  B  C concat
0  1  4  7    147
1  2  5  8    258
2  3  6  9    369

选择一组给定的列：

df['concat'] = pd.Series(df[['A', 'B']].fillna('').values.tolist()).str.join('')

df
Out[8]: 
   A  B  C concat
0  1  4  7     14
1  2  5  8     25
2  3  6  9     36

但是，我注意到这种方法有时会导致NaN被填充到不应该的位置，所以这是另一种方式：

>>> from functools import reduce
>>> df['concat'] = df[cols].apply(lambda x: reduce(lambda a, b: a + b, x), axis=1)
>>> df
   A  B  C concat
0  1  4  7    147
1  2  5  8    258
2  3  6  9    369

虽然应该注意这种方法要慢得多：

$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[df[["a", "b"]].apply(lambda x: reduce(lambda a, b: a + b, x)) for _ in range(10)]'
10 loops, best of 3: 451 msec per loop

对战

$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[pd.Series(df[["a", "b"]].fillna("").values.tolist()).str.join(" ") for _ in range(10)]'
10 loops, best of 3: 98.5 msec per loop

Answer 2

使用sum的解决方案，但输出为float，因此必须转换为int和str：

df['new'] = df.sum(axis=1).astype(int).astype(str)

apply函数join的另一个解决方案，但它最慢：

df['new'] = df.apply(''.join, axis=1)

上次非常快numpy solution - 转换为numpy array，然后'sum'：

df['new'] = df.values.sum(axis=1)

<强>计时：

df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)

cols = list('ABC')

#not_a_robot solution
In [259]: %timeit df['concat'] = pd.Series(df[cols].fillna('').values.tolist()).str.join('')
100 loops, best of 3: 17.4 ms per loop

In [260]: %timeit df['new'] = df[cols].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 386 ms per loop

In [261]: %timeit df['new1'] = df[cols].values.sum(axis=1)
100 loops, best of 3: 6.5 ms per loop

In [262]: %timeit df['new2'] = df[cols].astype(str).sum(axis=1).astype(int).astype(str)
10 loops, best of 3: 68.6 ms per loop

编辑如果DataFrame.astype某些列的dtypes不是object（显然是string s）：

df['new'] = df.astype(str).values.sum(axis=1)

Answer 3

我没有足够的声誉来发表评论，所以我的答案来自黑场的回应。

为清楚起见，LunchBox评论说它在Python 3.7.0中失败。它对我来说在Python 3.6.3上也失败了。这是blacksite的原始答案：

df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')

这是我对Python 3.6.3的修改：

df['concat'] = pd.Series(df.fillna('').values.tolist()).map(lambda x: ''.join(map(str,x)))

Answer 4

上面给出的使用numpy数组的解决方案对我来说非常有效。

但是，要注意的一件事是从numpy.ndarray获得df.values时的索引编制，因为已从df.values中删除了轴标签。

因此，以上述提供的一种解决方案（我最常使用的一种）为例：

df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')

此部分：

df.fillna('').values

不保留原始DataFrame的索引。当DataFrame具有通用的0, 1, 2, ...行索引方案时，这不是问题，但是当以任何其他方式对DataFrame进行索引时，此解决方案将不起作用。您可以通过在index=上添加pd.Series()参数来解决此问题：

df['concat'] = pd.Series(df.fillna('').values.tolist(), 
                         index=df.index).str.join('')

为了安全起见，我总是添加index=参数，即使我确定DataFrame的行索引为0, 1, 2, ...

Answer 5

作为@Gary Dorman 在评论中提出的问题的解决方案，
i would want to have a delimiter in place so when you're looking at your overall column, you can see how it's broken out.

你可能会使用

df_tmp=df.astype(str) + ','
df_tmp.sum(axis=1).str.rstrip(',')

之前：

1.2.3.480tcp
6.6.6.680udp
7.7.7.78080tcp
8.8.8.88080tcp
9.9.9.98080tcp

之后：

1.2.3.4,80,tcp
6.6.6.6,80,udp
7.7.7.7,8080,tcp
8.8.8.8,8080,tcp
9.9.9.9,8080,tcp

看起来更好（如 CSV :) 这个额外的 sep 步骤在我的机器上慢了大约 30%。

连接pandas数据框中的所有列

5 个答案: