我有一个csv文件,目前正在使用pandas模块。尚未找到解决我问题的方法。这是示例,问题和所需的输出csv。
csv示例:
project, id, sec, code
1, 25, 50, 01
1, 25, 50, 12
1, 25, 45, 07
1, 5, 25, 03
1, 25, 20, 06
问题:
如果要在给定其他代码(例如12、7和6)的情况下找到重复项,我不想摆脱重复的(id),而是将(sec)的值与(代码)01相加。我需要知道如何设置条件。如果代码7小于60,则不要求和。我使用以下代码按列排序。但是,.isin会删除“ id” 5。在更大的文件中,还会有其他重复的具有相似代码的“ id”。
df = df.sort_values(by=['id'], ascending=[True])
df2 = df.copy()
sort1 = df2[df2['code'].isin(['01', '07', '06', '12'])]
所需的输出:
project, id, sec, code
1, 5, 25, 03
1, 25, 120, 01
1, 25, 50, 12
1, 25, 45, 07
1, 25, 20, 06
我曾经考虑过解析文件,但是我坚持逻辑。
def edit_data(df):
sum = 0
with open(df) as file:
next(file)
for line in file:
parts = line.split(',')
code = float(parts[3])
id = float(parts[1])
sec = float(parts[2])
return ?
感谢任何帮助,因为我是Python新手,相当于3个月的经验。谢谢!
答案 0 :(得分:1)
让我们尝试一下:
df = df.sort_values('id')
#Use boolean indexing to eliminate unwanted records, then groupby and sum, convert the results to dataframe with indexes of groups.
sumdf = df[~((df.code == 7) & (df.sec < 60))].groupby(['project','id'])['sec'].sum().to_frame()
#Find first record of the group using duplicated and again with boolean indexing set the sec column for those records to NaN.
df.loc[~df.duplicated(subset=['project','id']),'sec'] = np.nan
#Set the index of the original dataframe and use combined_first to replace those NaN with values from the summed, grouped dataframe.
df_out = df.set_index(['project','id']).combine_first(sumdf).reset_index().astype(int)
df_out
输出:
project id code sec
0 1 5 3 25
1 1 25 1 120
2 1 25 12 50
3 1 25 7 45
4 1 25 6 20