熊猫按功能分组不正确吗?

时间:2019-04-04 00:21:11

标签: python pandas dataframe rename reindex

我有一个看起来像这样的数据集:

            NUM 80000   80001   80002   80003   80010   80011   80013   80023
CUSTOM_SITES    NAME    CC  DD  EE  FF  GG  HH  JJ  KK
    X   0   0   0   181621  0   0   809 67
    Y   0   0   0   1885    0   0   17  0
a   Z   0   0   0   43  0   0   0   0
a   T   0   0   0   324 0   0   2   0
a   W   0   0   0   336 0   0   8   0
a   F   0   0   0   21  0   0   0   0
a   P   0   0   0   253 0   0   0   0
a   D   0   0   0   163 0   0   4   0
a   C   0   0   0   122 0   0   2   0
a   D   0   0   0   122 0   0   1   0
a   PPPP    0   0   0   61  0   0   0   0
a   NN  0   0   0   440 0   0   0   0
    EE  0   0   0   45530   0   0   166 6
E   RR  0   0   0   1726    0   0   4   0
S   KKKK    0   0   0   2398    0   0   4   0
SI  QQQ 0   0   0   286 0   0   0   0
    AAA 0   0   0   13425   0   0   13  1
    DDD 0   0   0   11566   0   0   11  0
C   WWWW    0   0   0   808 0   0   2   0
C   NNN 0   0   0   50  0   0   0   0
C   GGGG    0   0   0   633 0   0   1   0

“ df.to_dict()”输出->

{'Unnamed: 0': {0: 'CUSTOM_SITES', 1: nan, 2: nan, 3: 'a', 4: 'a', 5: 'a', 6: 'a', 7: 'a', 8: 'a', 9: 'a', 10: 'a', 11: 'a', 12: 'a', 13: nan, 14: 'E', 15: 'S', 16: 'SI', 17: nan, 18: nan, 19: 'C', 20: 'C', 21: 'C'}, 'NUM': {0: 'NAME', 1: 'X', 2: 'Y', 3: 'Z', 4: 'T', 5: 'W', 6: 'F', 7: 'P', 8: 'D', 9: 'C', 10: 'D', 11: 'PPPP', 12: 'NN', 13: 'EE', 14: 'RR', 15: 'KKKK', 16: 'QQQ', 17: 'AAA', 18: 'DDD', 19: 'WWWW', 20: 'NNN', 21: 'GGGG'}, '80000': {0: 'CC', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80001': {0: 'DD', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80002': {0: 'EE', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80003': {0: 'FF', 1: '181621', 2: '1885', 3: '43', 4: '324', 5: '336', 6: '21', 7: '253', 8: '163', 9: '122', 10: '122', 11: '61', 12: '440', 13: '45530', 14: '1726', 15: '2398', 16: '286', 17: '13425', 18: '11566', 19: '808', 20: '50', 21: '633'}, '80010': {0: 'GG', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80011': {0: 'HH', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80013': {0: 'JJ', 1: '809', 2: '17', 3: '0', 4: '2', 5: '8', 6: '0', 7: '0', 8: '4', 9: '2', 10: '1', 11: '0', 12: '0', 13: '166', 14: '4', 15: '4', 16: '0', 17: '13', 18: '11', 19: '2', 20: '0', 21: '1'}, '80023': {0: 'KK', 1: '67', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '6', 14: '0', 15: '0', 16: '0', 17: '1', 18: '0', 19: '0', 20: '0', 21: '0'}}

我在代码中的第一步是忽略第一行,然后用第二行重命名df,然后按“ CUSTOM SITES”列进行分组。下面是代码:

dirpath= "..."
df = pd.read_table("...")
header = df.iloc[0]
df = df[1:]
df = df.rename(columns = header)
df = df.reset_index(drop=True)
df.groupby("CUSTOM_SITES",sort=False).sum().to_csv(os.path.join(dirpath,'collapsed_sites_out.txt'), sep='\t', encoding='utf-8',quoting=0, index=True)

所以问题是groupby函数没有按CUSTOM SITES分组,而只是给我一列作为输出,而我的输出应该是CUSTOM SITES折叠并且80000 ..... 80023作为列。请帮忙!

1 个答案:

答案 0 :(得分:0)

上述问题的解决方案:

import pandas as pd
import os
dirpath = "..."
df = pd.read_table("...")
#extract row from original df dataframe (this is the second row- with histo names)
header = df.iloc[0]
#overwrite df with row 1 and all columns
df = df[1:]
#rename the columns
df = df.rename(columns = header)
#following three lines collapse the rows into intended sites
df = df.set_index(['CUSTOM_SITES','NAME'])
df = df.apply(pd.to_numeric,errors='coerce')
print(df.head(5))
df = df.reset_index().groupby('CUSTOM_SITES',sort=False).sum()
#write dataFrame to file - make sure index is true so u have row names
df.to_csv(os.path.join(dirpath,'out.txt'), sep='\t', encoding='utf-8',quoting=0, index=True)