我有以下代码:
import os
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
fileName= input("Enter file name here (Case Sensitve) > ")
df = pd.read_excel(fileName +'.xlsx', sheetname=None, ignore_index=True)
xl = pd.ExcelFile(fileName +'.xlsx')
SystemCount= len(xl.sheet_names)
df1 = pd.DataFrame([])
for y in range(1, int(SystemCount)+ 1):
df = pd.read_excel(xl,'System ' + str(y))
df['System {0}'.format(y)] = "1"
df1 = df1.append(df)
df1 = df1.sort_values(['Email'])
df = df1['Email'].value_counts()
df1['Count'] = df1.groupby('Email')['Email'].transform('count')
print(df1)
打印出类似这样的内容:
Email System 1 System 2 System 3 System 4 Count
test_1_@test.com NaN 1 NaN NaN 1
test_2_@test.com NaN NaN 1 NaN 3
test_2_@test.com 1 NaN NaN NaN 3
test_2_@test.com NaN NaN NaN 1 3
test_3_@test.com NaN 1 NaN NaN 1
test_4_@test.com NaN NaN 1 NaN 1
test_5_@test.com 1 NaN NaN NaN 3
test_5_@test.com NaN NaN 1 NaN 3
test_5_@test.com NaN NaN NaN 1 3
如何组合使用此功能,以便电子邮件只显示一次,所有标记的系统? 我希望输出看起来像这样:
System1 System2 System3 System4 Count
Email
test_1_@test.com 0.0 1.0 0.0 0.0 1
test_2_@test.com 1.0 0.0 1.0 1.0 3
test_3_@test.com 0.0 1.0 0.0 0.0 1
test_4_@test.com 0.0 0.0 1.0 0.0 1
test_5_@test.com 1.0 0.0 1.0 1.0 3
答案 0 :(得分:2)
如果我清楚地理解
df1=df1.apply(lambda x : pd.to_numeric(x,errors='ignore'))
d=dict(zip(df1.columns[1:],['sum']*df1.columns[1:].str.contains('System').sum()+['first']))
df1.fillna(0).groupby('Email').agg(d)
Out[95]:
System1 System2 System3 System4 Count
Email
test_1_@test.com 0.0 1.0 0.0 0.0 1
test_2_@test.com 1.0 0.0 1.0 1.0 3
test_3_@test.com 0.0 1.0 0.0 0.0 1
test_4_@test.com 0.0 0.0 1.0 0.0 1
test_5_@test.com 1.0 0.0 1.0 1.0 3
答案 1 :(得分:1)
如果您要发布代码以生成输入数据,则可以更轻松地获得帮助。
但您可能需要GroupBy
:
df2 = df1.groupby('Email').sum()