df.apply(sorted,axis = 1)是否删除列名?

时间:2019-05-28 22:35:08

标签: python python-3.x pandas

阅读《熊猫食谱》。计算城市之间的航班总数。

import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt

print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')

desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format

file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()

# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()

# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()

# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()

# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()

# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()

当我到达这一行代码时,我的输出与作者不同:

```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```

我的输出不包含任何列名。结果,当我到达时:

```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```

它将引发KeyError。这很有意义,因为我试图在不存在任何列名的情况下重命名列。

我的问题是,为什么列名消失了?所有其他输出与作者输出完全匹配:

Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
   MONTH  DAY  WEEKDAY AIRLINE ORG_AIR DEST_AIR  SCHED_DEP  DEP_DELAY  AIR_TIME  DIST  SCHED_ARR  ARR_DELAY  DIVERTED  CANCELLED
0      1    1        4      WN     LAX      SLC       1625       58.0      94.0   590       1905       65.0         0          0
1      1    1        4      UA     DEN      IAD        823        7.0     154.0  1452       1333      -13.0         0          0
2      1    1        4      MQ     DFW      VPS       1305       36.0      85.0   641       1453       35.0         0          0
3      1    1        4      AA     DFW      DCA       1555        7.0     126.0  1192       1935       -7.0         0          0
4      1    1        4      WN     LAX      MCI       1720       48.0     166.0  1363       2225       39.0         0          0
5      1    1        4      UA     IAH      SAN       1450        1.0     178.0  1303       1620      -14.0         0          0
6      1    1        4      AA     DFW      MSY       1250       84.0      64.0   447       1410       83.0         0          0
7      1    1        4      F9     SFO      PHX       1020       -7.0      91.0   651       1315       -6.0         0          0
8      1    1        4      AA     ORD      STL       1845       -5.0      44.0   258       1950       -5.0         0          0
9      1    1        4      UA     IAH      SJC        925        3.0     215.0  1608       1136      -14.0         0          0

ORG_AIR  DEST_AIR
ATL      ABE         31
         ABQ         16
         ABY         19
         ACY          6
         AEX         40
         AGS         83
         ALB         33
         ANC          2
         ASE          1
         ATW         10
dtype: int64

ORG_AIR  DEST_AIR
ATL      IAH         121
IAH      ATL         148
dtype: int64

*** No columns names ***  Why?

0    [LAX, SLC]
1    [DEN, IAD]
2    [DFW, VPS]
3    [DCA, DFW]
4    [LAX, MCI]
5    [IAH, SAN]
6    [DFW, MSY]
7    [PHX, SFO]
8    [ORD, STL]
9    [IAH, SJC]
dtype: object

作者的输出。请注意,存在列名称。

It's not recommended

2 个答案:

答案 0 :(得分:1)

sorted返回一个列表对象并清除列:

In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])

In [12]: df.apply(sorted, axis=1)
Out[12]:
0    [1, 2]
1    [3, 4]
dtype: object

In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list

在早期的熊猫中可能不是这种情况……但是它仍然是错误的代码。

您可以通过显式传递列来做到这一点:

In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
   A  B
0  1  2
1  3  4

一种更有效的方法是对底层numpy数组进行排序:

In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])

In [22]: df
Out[22]:
   A  B
0  1  2
1  3  1

In [23]: arr = df[["A", "B"]].values

In [24]: arr.sort(axis=1)

In [25]: df[["A", "B"]] = arr

In [26]: df
Out[26]:
   A  B
0  1  2
1  1  3

如您所见,这对每个进行排序。

答案 1 :(得分:0)

最后的笔记。我只是从上面应用了基于@AndyHayden numpy的解决方案。

flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort

我只能说...哇。多么巨大的性能差异。我完全一样 正确答案,与单击鼠标相比,@ AndyHayden还提供了pandas lambda解决方案,我得到了答案,该解决方案大约需要20秒钟来执行排序。该数据集超过58,000行。 numpy解决方案立即返回排序。