Question

请让我知道这是否重复，但是我相信我检查了大多数类似的问题，但不幸的是我还没有找到答案。我是熊猫新手，所以请提前抱歉。经过大量的合并和分组后，我有了一个数据框，如下所示：

0 A B C D E F G H I J K L
1 x 0 1 1 2 1 3 1 2 3 3 4
2 x 1 0 0 0 0 0 0 0 0 0 0
3 y 0 4 5 1 1 2 1 3 4 5 3
4 y 1 0 0 0 0 0 0 0 0 0 0
5 z 1 0 0 0 0 0 0 0 0 0 0

在B具有值的情况下，其余列没有值，在其余列具有值的情况下B没有。该值从不为NaN，它们始终为0.0。

我的预期输出为：

0 A B C D E F G H I J K L
1 x 1 1 1 2 1 3 1 2 3 3 4
2 y 1 4 5 1 1 2 1 3 4 5 3
3 z 1 0 0 0 0 0 0 0 0 0 0

在这里，我已经关注了几个类似问题的答案。我尝试了groupby(A).agg('sum') 尝试了this和其他几个答案。结果始终是相同的，返回的数据框仍然有重复项，并且值未加总，编辑：或值已完全删除。

我遇到问题的数据框示例：

{'Higher managerial administrative and professional occupations': [0.0,
  2332.0,
  0.0,
  240.0,
  0.0],
 'Intermediate occupations': [0.0, 538.0, 0.0, 670.0, 0.0],
 'Lower managerial administrative and professional occupations': [0.0,
  2098.0,
  0.0,
  733.0,
  0.0],
 'Lower supervisory and technical occupations': [0.0, 166.0, 0.0, 321.0, 0.0],
 'MSOA11CD': ['E02000001',
  'E02000001 ',
  'E02000002',
  'E02000002 ',
  'E02000003'],
 'Never worked and long-term unemployed': [0.0, 225.0, 0.0, 503.0, 0.0],
 'Not classified': [0.0, 471.0, 0.0, 410.0, 0.0],
 'Routine occupations': [0.0, 168.0, 0.0, 659.0, 0.0],
 'Semi-routine occupations': [0.0, 290.0, 0.0, 964.0, 0.0],
 'Small employers and own account workers': [0.0, 416.0, 0.0, 478.0, 0.0],
 'number of crimes': [2125.0, 0.0, 517.0, 0.0, 1095.0]}

MSOA11CD是上方的A列，number of crimes是B列。该数据框是通过合并创建的

{'Higher managerial administrative and professional occupations': [2332.0,
  240.0,
  554.0,
  288.0,
  275.0],
 'Intermediate occupations': [538.0, 670.0, 1294.0, 847.0, 894.0],
 'Lower managerial administrative and professional occupations': [2098.0,
  733.0,
  1408.0,
  875.0,
  927.0],
 'Lower supervisory and technical occupations': [166.0,
  321.0,
  516.0,
  383.0,
  516.0],
 'MSOA11CD': ['E02000001 ',
  'E02000002 ',
  'E02000003 ',
  'E02000004 ',
  'E02000005 '],
 'Never worked and long-term unemployed': [225.0, 503.0, 656.0, 407.0, 560.0],
 'Not classified': [471.0, 410.0, 635.0, 386.0, 542.0],
 'Routine occupations': [168.0, 659.0, 752.0, 603.0, 883.0],
 'Semi-routine occupations': [290.0, 964.0, 1156.0, 714.0, 1145.0],
 'Small employers and own account workers': [416.0,
  478.0,
  741.0,
  442.0,
  583.0]}

和

{'MSOA11CD': ['E02000001', 'E02000002', 'E02000003', 'E02000004', 'E02000005'], 'number of crimes': [2125, 517, 1095, 555, 914]}

，它是通过在{p>上使用groupby创建的

{'Falls within': ['British Transport Police',
  'City of London Police',
  'Metropolitan Police Service',
  'Metropolitan Police Service',
  'Metropolitan Police Service'],
 'MSOA11CD': ['E02000001', 'E02000001', 'E02000001', 'E02000002', 'E02000003'],
 'number of crimes': [98, 1365, 662, 517, 1095]}

理想情况下，我想保留Falls within列，但是按该列分组会导致丢失所有数值数据。我希望这有帮助。谢谢。

Answer 1

单元格的字符串值中可能存在空格。您可以尝试使用pandas.Series.str.strip忽略空格。下面是一个数据框，该数据框在第0行的A列中包含一个空格：

df=pd.read_csv('d:/sof/training/file5.csv', sep='\s+')
df.at[0,'A']='x '
df

输出：

然后我尝试df.groupby('A').agg(sum, axis=0)，输出为：

仍然出现重复值的原因是“ x”的一个值中有一个空格。（'x'）和（'x'）之间有区别。因此，请确保A列的所有值中都没有空格。以下是省略所有空格后的结果：

df.A=df.A.str.strip()
df=df.groupby('A').agg(sum, axis=0)
df

输出：

=====编辑====

看到空间后，看看这个：

{'E02000001': Int64Index([0], dtype='int64'),
 'E02000001 ': Int64Index([1], dtype='int64'),
 'E02000002': Int64Index([2], dtype='int64'),
 'E02000002 ': Int64Index([3], dtype='int64'),
 'E02000003': Int64Index([4], dtype='int64')}

使用pandas.strip之后，groupby将正常运行：

df.MSOA11CD=df.MSOA11CD.str.strip()
df.groupby('MSOA11CD').groups

输出：

{'E02000001': Int64Index([0, 1], dtype='int64'),
 'E02000002': Int64Index([2, 3], dtype='int64'),
 'E02000003': Int64Index([4], dtype='int64')}

希望这可以为您提供帮助。

Answer 2

合并数据帧的问题源于其中的字符串末尾的空格

 'MSOA11CD': ['E02000001 ',
  'E02000002 ',
  'E02000003 ',
  'E02000004 ',
  'E02000005 '],

请注意，其他数据框不包含这些空格。因此，Pandas（正确）将字符串'E02000001 '和'E02000001'视为不同的值。要组合它们，请从以下字符串中去除空格：

df1['MSOA11CD'] = df1['MSOA11CD'].str.strip()

例如，

import sys
import pandas as pd
pd.options.display.width = sys.maxsize
pd.options.display.max_columns = None


df1 = pd.DataFrame({'Higher managerial administrative and professional occupations': [2332.0,
      240.0,
      554.0,
      288.0,
      275.0],
     'Intermediate occupations': [538.0, 670.0, 1294.0, 847.0, 894.0],
     'Lower managerial administrative and professional occupations': [2098.0,
      733.0,
      1408.0,
      875.0,
      927.0],
     'Lower supervisory and technical occupations': [166.0,
      321.0,
      516.0,
      383.0,
      516.0],
     'MSOA11CD': ['E02000001 ',
      'E02000002 ',
      'E02000003 ',
      'E02000004 ',
      'E02000005 '],
     'Never worked and long-term unemployed': [225.0, 503.0, 656.0, 407.0, 560.0],
     'Not classified': [471.0, 410.0, 635.0, 386.0, 542.0],
     'Routine occupations': [168.0, 659.0, 752.0, 603.0, 883.0],
     'Semi-routine occupations': [290.0, 964.0, 1156.0, 714.0, 1145.0],
     'Small employers and own account workers': [416.0,
      478.0,
      741.0,
      442.0,
      583.0]})

df2 = pd.DataFrame({'MSOA11CD': ['E02000001', 'E02000002', 'E02000003', 'E02000004', 'E02000005'], 'number of crimes': [2125, 517, 1095, 555, 914]})

df3 = pd.DataFrame({'Falls within': ['British Transport Police',
      'City of London Police',
      'Metropolitan Police Service',
      'Metropolitan Police Service',
      'Metropolitan Police Service'],
     'MSOA11CD': ['E02000001', 'E02000001', 'E02000001', 'E02000002', 'E02000003'],
     'number of crimes': [98, 1365, 662, 517, 1095]})

df1['MSOA11CD'] = df1['MSOA11CD'].str.strip()
df = pd.merge(df1, df2, on=['MSOA11CD'])
df = pd.merge(df, df3, on=['MSOA11CD'])

print(df)

收益

   Higher managerial administrative and professional occupations  Intermediate occupations  Lower managerial administrative and professional occupations  Lower supervisory and technical occupations   MSOA11CD  Never worked and long-term unemployed  Not classified  Routine occupations  Semi-routine occupations  Small employers and own account workers  number of crimes_x                 Falls within  number of crimes_y
0                                             2332.0                                 538.0                                             2098.0                                                   166.0  E02000001                                  225.0           471.0                168.0                     290.0                                    416.0                2125     British Transport Police                  98
1                                             2332.0                                 538.0                                             2098.0                                                   166.0  E02000001                                  225.0           471.0                168.0                     290.0                                    416.0                2125        City of London Police                1365
2                                             2332.0                                 538.0                                             2098.0                                                   166.0  E02000001                                  225.0           471.0                168.0                     290.0                                    416.0                2125  Metropolitan Police Service                 662
3                                              240.0                                 670.0                                              733.0                                                   321.0  E02000002                                  503.0           410.0                659.0                     964.0                                    478.0                 517  Metropolitan Police Service                 517
4                                              554.0                                1294.0                                             1408.0                                                   516.0  E02000003                                  656.0           635.0                752.0                    1156.0                                    741.0                1095  Metropolitan Police Service                1095

如何在合并其余列的值时删除熊猫中一列的重复行？

2 个答案: