我有一个看起来像这样的数据集:
NUM 80000 80001 80002 80003 80010 80011 80013 80023
CUSTOM_SITES NAME CC DD EE FF GG HH JJ KK
X 0 0 0 181621 0 0 809 67
Y 0 0 0 1885 0 0 17 0
a Z 0 0 0 43 0 0 0 0
a T 0 0 0 324 0 0 2 0
a W 0 0 0 336 0 0 8 0
a F 0 0 0 21 0 0 0 0
a P 0 0 0 253 0 0 0 0
a D 0 0 0 163 0 0 4 0
a C 0 0 0 122 0 0 2 0
a D 0 0 0 122 0 0 1 0
a PPPP 0 0 0 61 0 0 0 0
a NN 0 0 0 440 0 0 0 0
EE 0 0 0 45530 0 0 166 6
E RR 0 0 0 1726 0 0 4 0
S KKKK 0 0 0 2398 0 0 4 0
SI QQQ 0 0 0 286 0 0 0 0
AAA 0 0 0 13425 0 0 13 1
DDD 0 0 0 11566 0 0 11 0
C WWWW 0 0 0 808 0 0 2 0
C NNN 0 0 0 50 0 0 0 0
C GGGG 0 0 0 633 0 0 1 0
“ df.to_dict()”输出->
{'Unnamed: 0': {0: 'CUSTOM_SITES', 1: nan, 2: nan, 3: 'a', 4: 'a', 5: 'a', 6: 'a', 7: 'a', 8: 'a', 9: 'a', 10: 'a', 11: 'a', 12: 'a', 13: nan, 14: 'E', 15: 'S', 16: 'SI', 17: nan, 18: nan, 19: 'C', 20: 'C', 21: 'C'}, 'NUM': {0: 'NAME', 1: 'X', 2: 'Y', 3: 'Z', 4: 'T', 5: 'W', 6: 'F', 7: 'P', 8: 'D', 9: 'C', 10: 'D', 11: 'PPPP', 12: 'NN', 13: 'EE', 14: 'RR', 15: 'KKKK', 16: 'QQQ', 17: 'AAA', 18: 'DDD', 19: 'WWWW', 20: 'NNN', 21: 'GGGG'}, '80000': {0: 'CC', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80001': {0: 'DD', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80002': {0: 'EE', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80003': {0: 'FF', 1: '181621', 2: '1885', 3: '43', 4: '324', 5: '336', 6: '21', 7: '253', 8: '163', 9: '122', 10: '122', 11: '61', 12: '440', 13: '45530', 14: '1726', 15: '2398', 16: '286', 17: '13425', 18: '11566', 19: '808', 20: '50', 21: '633'}, '80010': {0: 'GG', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80011': {0: 'HH', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80013': {0: 'JJ', 1: '809', 2: '17', 3: '0', 4: '2', 5: '8', 6: '0', 7: '0', 8: '4', 9: '2', 10: '1', 11: '0', 12: '0', 13: '166', 14: '4', 15: '4', 16: '0', 17: '13', 18: '11', 19: '2', 20: '0', 21: '1'}, '80023': {0: 'KK', 1: '67', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '6', 14: '0', 15: '0', 16: '0', 17: '1', 18: '0', 19: '0', 20: '0', 21: '0'}}
我在代码中的第一步是忽略第一行,然后用第二行重命名df,然后按“ CUSTOM SITES”列进行分组。下面是代码:
dirpath= "..."
df = pd.read_table("...")
header = df.iloc[0]
df = df[1:]
df = df.rename(columns = header)
df = df.reset_index(drop=True)
df.groupby("CUSTOM_SITES",sort=False).sum().to_csv(os.path.join(dirpath,'collapsed_sites_out.txt'), sep='\t', encoding='utf-8',quoting=0, index=True)
所以问题是groupby函数没有按CUSTOM SITES分组,而只是给我一列作为输出,而我的输出应该是CUSTOM SITES折叠并且80000 ..... 80023作为列。请帮忙!
答案 0 :(得分:0)
上述问题的解决方案:
import pandas as pd
import os
dirpath = "..."
df = pd.read_table("...")
#extract row from original df dataframe (this is the second row- with histo names)
header = df.iloc[0]
#overwrite df with row 1 and all columns
df = df[1:]
#rename the columns
df = df.rename(columns = header)
#following three lines collapse the rows into intended sites
df = df.set_index(['CUSTOM_SITES','NAME'])
df = df.apply(pd.to_numeric,errors='coerce')
print(df.head(5))
df = df.reset_index().groupby('CUSTOM_SITES',sort=False).sum()
#write dataFrame to file - make sure index is true so u have row names
df.to_csv(os.path.join(dirpath,'out.txt'), sep='\t', encoding='utf-8',quoting=0, index=True)