Python:Pandas:两列具有相同的值,按字母顺序排序和存储

时间:2016-10-22 02:56:40

标签: python pandas

问题
" df有两列,但有时会填充相同的值。我们需要将它们重新保存到两个新列中,但按字母顺序排列"

上下文
我们有这样的熊猫df:

df = pd.DataFrame([{"name_A": "john", "name_B": "mac"}, {"name_A": "mac", "name_B": "john"}])

像这样:

name_A | name_B
john   |  mac 
mac    |  john 
Trump  |  Clinton


期望输出

name_A | name_B   | restated_A  | restated_B
john   |  mac     |  john       |  mac
mac    |  john    |  john       |  mac
trump  |  clinton |  clinton    | trump   

用语言来说,我们希望列中的列数为'值name_Aname_B按字母顺序排列在restated_Arestated_B

目前已尝试
一堆lambdas但无法让它工作

规格
Python:3.5.2
熊猫:0.18.1

3 个答案:

答案 0 :(得分:4)

作为替代矢量化解决方案,您可以使用numpy.minimum()numpy.maximum()

import numpy as np
df['restart_A'] = np.minimum(df['name_A'], df['name_B'])
df['restart_B'] = np.maximum(df['name_A'], df['name_B']) 

enter image description here

或使用apply方法:

df[['restated_A', 'restated_B']] = df.apply(lambda r: sorted(r), axis = 1)

enter image description here

答案 1 :(得分:2)

只需将df.values发送到列表,然后对每行的列表进行排序。然后相应地重新分配对中的元素。

>>> df = pd.DataFrame([{"name_A": "john", "name_B": "mac"}, {"name_A": "mac", "name_B": "john"}])
>>> restated_values = [sorted(pair) for pair in df.values.tolist()]
>>> restated_values
[['john', 'mac'], ['john', 'mac']]
>>> df['restated_A'] = [pair[0] for pair in restated_values]
>>> df
  name_A name_B restated_A
0   john    mac       john
1    mac   john       john
>>> df['restated_b'] = [pair[1] for pair in restated_values]
>>> df
  name_A name_B restated_A restated_b
0   john    mac       john        mac
1    mac   john       john        mac

或者,您可以使用dict和新的pandas.DataFrame对象执行此操作:

>>> df = pd.DataFrame([{"name_A": "john", "name_B": "mac"}, {"name_A": "mac", "name_B": "john"}])
>>> restated_values = [sorted(pair) for pair in df.values.tolist()]
>>> restated_values
[['john', 'mac'], ['john', 'mac']]
>>> new_col_rows = {'restated_A': [pair[0] for pair in restated_values], 'restated_B': [pair[1] for pair in restated_values]}
>>> new_col_rows
{'restated_A': ['john', 'john'], 'restated_B': ['mac', 'mac']}
>>> new_df = pd.DataFrame(new_col_rows)
>>> new_df
  restated_A restated_B
0       john        mac
1       john        mac
>>> df = df.join(new_df)
>>> df
  name_A name_B restated_A restated_B
0   john    mac       john        mac
1    mac   john       john        mac

答案 2 :(得分:-1)

您可以使用NumPy sort()方法对“就地”进行排序:

In [57]: df
Out[57]:
  name_A   name_B
0   john      mac
1    mac     john
2  Trump  Clinton

In [58]: df.values.sort(axis=1)

In [59]: df
Out[59]:
    name_A name_B
0     john    mac
1     john    mac
2  Clinton  Trump

针对30K行的时间DF:

In [69]: %%timeit
    ...: big = pd.concat([df.copy()] * 10**4, ignore_index=True)
    ...: big.values.sort(axis=1)
    ...:
1 loop, best of 3: 2.25 s per loop

In [70]: %%timeit
    ...: big = pd.concat([df.copy()] * 10**4, ignore_index=True)
    ...: big.apply(lambda r: sorted(r), axis = 1)
    ...:
1 loop, best of 3: 15.9 s per loop

In [71]: %%timeit
    ...: big = pd.concat([df.copy()] * 10**4, ignore_index=True)
    ...: pd.DataFrame([sorted(pair) for pair in big.values.tolist()], columns=df.columns)
    ...:
1 loop, best of 3: 2.29 s per loop

针对300K行的时间DF:

In [73]: %%timeit
    ...: big = pd.concat([df.copy()] * 10**5, ignore_index=True)
    ...: big.values.sort(axis=1)
    ...:
1 loop, best of 3: 23 s per loop

In [74]: %%timeit
    ...: big = pd.concat([df.copy()] * 10**5, ignore_index=True)
    ...: big.apply(lambda r: sorted(r), axis = 1)
    ...:
1 loop, best of 3: 2min 39s per loop

In [75]: %%timeit
    ...: big = pd.concat([df.copy()] * 10**5, ignore_index=True)
    ...: pd.DataFrame([sorted(pair) for pair in big.values.tolist()], columns=df.columns)
    ...:
1 loop, best of 3: 23.4 s per loop