请让我知道这是否重复,但是我相信我检查了大多数类似的问题,但不幸的是我还没有找到答案。我是熊猫新手,所以请提前抱歉。 经过大量的合并和分组后,我有了一个数据框,如下所示:
0 A B C D E F G H I J K L
1 x 0 1 1 2 1 3 1 2 3 3 4
2 x 1 0 0 0 0 0 0 0 0 0 0
3 y 0 4 5 1 1 2 1 3 4 5 3
4 y 1 0 0 0 0 0 0 0 0 0 0
5 z 1 0 0 0 0 0 0 0 0 0 0
在B具有值的情况下,其余列没有值,在其余列具有值的情况下B没有。该值从不为NaN,它们始终为0.0。
我的预期输出为:
0 A B C D E F G H I J K L
1 x 1 1 1 2 1 3 1 2 3 3 4
2 y 1 4 5 1 1 2 1 3 4 5 3
3 z 1 0 0 0 0 0 0 0 0 0 0
在这里,我已经关注了几个类似问题的答案。我尝试了groupby(A).agg('sum')
尝试了this和其他几个答案。结果始终是相同的,返回的数据框仍然有重复项,并且值未加总,编辑:或值已完全删除。
我遇到问题的数据框示例:
{'Higher managerial administrative and professional occupations': [0.0,
2332.0,
0.0,
240.0,
0.0],
'Intermediate occupations': [0.0, 538.0, 0.0, 670.0, 0.0],
'Lower managerial administrative and professional occupations': [0.0,
2098.0,
0.0,
733.0,
0.0],
'Lower supervisory and technical occupations': [0.0, 166.0, 0.0, 321.0, 0.0],
'MSOA11CD': ['E02000001',
'E02000001 ',
'E02000002',
'E02000002 ',
'E02000003'],
'Never worked and long-term unemployed': [0.0, 225.0, 0.0, 503.0, 0.0],
'Not classified': [0.0, 471.0, 0.0, 410.0, 0.0],
'Routine occupations': [0.0, 168.0, 0.0, 659.0, 0.0],
'Semi-routine occupations': [0.0, 290.0, 0.0, 964.0, 0.0],
'Small employers and own account workers': [0.0, 416.0, 0.0, 478.0, 0.0],
'number of crimes': [2125.0, 0.0, 517.0, 0.0, 1095.0]}
MSOA11CD
是上方的A
列,number of crimes
是B
列。
该数据框是通过合并创建的
{'Higher managerial administrative and professional occupations': [2332.0,
240.0,
554.0,
288.0,
275.0],
'Intermediate occupations': [538.0, 670.0, 1294.0, 847.0, 894.0],
'Lower managerial administrative and professional occupations': [2098.0,
733.0,
1408.0,
875.0,
927.0],
'Lower supervisory and technical occupations': [166.0,
321.0,
516.0,
383.0,
516.0],
'MSOA11CD': ['E02000001 ',
'E02000002 ',
'E02000003 ',
'E02000004 ',
'E02000005 '],
'Never worked and long-term unemployed': [225.0, 503.0, 656.0, 407.0, 560.0],
'Not classified': [471.0, 410.0, 635.0, 386.0, 542.0],
'Routine occupations': [168.0, 659.0, 752.0, 603.0, 883.0],
'Semi-routine occupations': [290.0, 964.0, 1156.0, 714.0, 1145.0],
'Small employers and own account workers': [416.0,
478.0,
741.0,
442.0,
583.0]}
和
{'MSOA11CD': ['E02000001', 'E02000002', 'E02000003', 'E02000004', 'E02000005'], 'number of crimes': [2125, 517, 1095, 555, 914]}
,它是通过在{p>上使用groupby
创建的
{'Falls within': ['British Transport Police',
'City of London Police',
'Metropolitan Police Service',
'Metropolitan Police Service',
'Metropolitan Police Service'],
'MSOA11CD': ['E02000001', 'E02000001', 'E02000001', 'E02000002', 'E02000003'],
'number of crimes': [98, 1365, 662, 517, 1095]}
理想情况下,我想保留Falls within
列,但是按该列分组会导致丢失所有数值数据。
我希望这有帮助。谢谢。
答案 0 :(得分:1)
单元格的字符串值中可能存在空格。您可以尝试使用pandas.Series.str.strip忽略空格。下面是一个数据框,该数据框在第0行的A列中包含一个空格:
df=pd.read_csv('d:/sof/training/file5.csv', sep='\s+')
df.at[0,'A']='x '
df
输出:
然后我尝试df.groupby('A').agg(sum, axis=0)
,输出为:
仍然出现重复值的原因是“ x”的一个值中有一个空格。 ('x')和('x')之间有区别。因此,请确保A列的所有值中都没有空格。以下是省略所有空格后的结果:
df.A=df.A.str.strip()
df=df.groupby('A').agg(sum, axis=0)
df
输出:
=====编辑====
看到空间后,看看这个:
{'E02000001': Int64Index([0], dtype='int64'),
'E02000001 ': Int64Index([1], dtype='int64'),
'E02000002': Int64Index([2], dtype='int64'),
'E02000002 ': Int64Index([3], dtype='int64'),
'E02000003': Int64Index([4], dtype='int64')}
使用pandas.strip之后,groupby将正常运行:
df.MSOA11CD=df.MSOA11CD.str.strip()
df.groupby('MSOA11CD').groups
输出:
{'E02000001': Int64Index([0, 1], dtype='int64'),
'E02000002': Int64Index([2, 3], dtype='int64'),
'E02000003': Int64Index([4], dtype='int64')}
希望这可以为您提供帮助。
答案 1 :(得分:1)
合并数据帧的问题源于其中的字符串末尾的空格
'MSOA11CD': ['E02000001 ',
'E02000002 ',
'E02000003 ',
'E02000004 ',
'E02000005 '],
请注意,其他数据框不包含这些空格。因此,Pandas(正确)将字符串'E02000001 '
和'E02000001'
视为不同的值。
要组合它们,请从以下字符串中去除空格:
df1['MSOA11CD'] = df1['MSOA11CD'].str.strip()
例如,
import sys
import pandas as pd
pd.options.display.width = sys.maxsize
pd.options.display.max_columns = None
df1 = pd.DataFrame({'Higher managerial administrative and professional occupations': [2332.0,
240.0,
554.0,
288.0,
275.0],
'Intermediate occupations': [538.0, 670.0, 1294.0, 847.0, 894.0],
'Lower managerial administrative and professional occupations': [2098.0,
733.0,
1408.0,
875.0,
927.0],
'Lower supervisory and technical occupations': [166.0,
321.0,
516.0,
383.0,
516.0],
'MSOA11CD': ['E02000001 ',
'E02000002 ',
'E02000003 ',
'E02000004 ',
'E02000005 '],
'Never worked and long-term unemployed': [225.0, 503.0, 656.0, 407.0, 560.0],
'Not classified': [471.0, 410.0, 635.0, 386.0, 542.0],
'Routine occupations': [168.0, 659.0, 752.0, 603.0, 883.0],
'Semi-routine occupations': [290.0, 964.0, 1156.0, 714.0, 1145.0],
'Small employers and own account workers': [416.0,
478.0,
741.0,
442.0,
583.0]})
df2 = pd.DataFrame({'MSOA11CD': ['E02000001', 'E02000002', 'E02000003', 'E02000004', 'E02000005'], 'number of crimes': [2125, 517, 1095, 555, 914]})
df3 = pd.DataFrame({'Falls within': ['British Transport Police',
'City of London Police',
'Metropolitan Police Service',
'Metropolitan Police Service',
'Metropolitan Police Service'],
'MSOA11CD': ['E02000001', 'E02000001', 'E02000001', 'E02000002', 'E02000003'],
'number of crimes': [98, 1365, 662, 517, 1095]})
df1['MSOA11CD'] = df1['MSOA11CD'].str.strip()
df = pd.merge(df1, df2, on=['MSOA11CD'])
df = pd.merge(df, df3, on=['MSOA11CD'])
print(df)
收益
Higher managerial administrative and professional occupations Intermediate occupations Lower managerial administrative and professional occupations Lower supervisory and technical occupations MSOA11CD Never worked and long-term unemployed Not classified Routine occupations Semi-routine occupations Small employers and own account workers number of crimes_x Falls within number of crimes_y
0 2332.0 538.0 2098.0 166.0 E02000001 225.0 471.0 168.0 290.0 416.0 2125 British Transport Police 98
1 2332.0 538.0 2098.0 166.0 E02000001 225.0 471.0 168.0 290.0 416.0 2125 City of London Police 1365
2 2332.0 538.0 2098.0 166.0 E02000001 225.0 471.0 168.0 290.0 416.0 2125 Metropolitan Police Service 662
3 240.0 670.0 733.0 321.0 E02000002 503.0 410.0 659.0 964.0 478.0 517 Metropolitan Police Service 517
4 554.0 1294.0 1408.0 516.0 E02000003 656.0 635.0 752.0 1156.0 741.0 1095 Metropolitan Police Service 1095