我有90个csv格式的文件,这些文件具有这样的数据-
PID, STARTED,%CPU,%MEM,COMMAND
1,Wed Sep 12 10:10:21 2018, 0.0, 0.0,init
2,Wed Sep 12 10:10:21 2018, 0.0, 0.0,kthreadd
现在,我必须以这样一种方式进行比较,即file2是否具有与file1重复的数据(PID,STARTED,%CPU,%MEM,COMMAND)。 如果file2具有重复数据,则选择具有所有值(PID,COMMAND,STARTED,%CPU,%MEM)的重复数据,并将其存储在单独的文件中。 同样的解释过程,我必须处理所有90个文件。 我的代码(方法)在这里。请看看-
file=open(r"Latest_27_02_2019.csv","r")
pidList=[]
pNameList=[]
memList=[]
startTimeList=[]
df=pd.read_csv(file)
pidList=df.index
df.columns = df.columns.str.strip()
pidList = df['PID']
pNameList=df['COMMAND']
memList=df['%MEM']
startTimeList=df['STARTED']
After that compare one by one.
但是由于我有大量文件。因此,将需要更多的时间和更多的迭代。 我以某种方式发现可以借助python(pandas库)以更简单的方式进行操作,但不知道如何做?请帮帮我吗?
答案 0 :(得分:0)
这是比较两个文件的解决方案:
#read file1 to df1
#your header seems no good with blank, so i rename it
df1 = pd.read_csv('file1', sep=',' header=1, names=['PID','STARTED','%CPU','%MEM','COMMAND']])
#df1 is your first file, df2 the second
df_compare = df1.merge(df2.drop_duplicates(), on=['PID','STARTED','%CPU','%MEM','COMMAND'],
how='right', indicator=True)
print(df_compare)
#in result, you'll have a column '_merge' with both or right_only
#right_only means only in df2 and not in df1
#after you just filtered:
filter = df_compare._merge == 'both'
df_compare = df_compare[filter].drop(['_merge'],axis=1)
#in df_compare, you have the repeateted rows from df2 and df1, you could reindex if you want
print(df_compare)
或其他解决方案(我认为更好):
df_compare = df1[df1.index.isin(df1.merge(df2, how='inner', on=['PID','STARTED','%CPU','%MEM','COMMAND']).index)]
print(df_compare)