我试图比较两个CSV文件(以及下面的更多文件)。我尝试了很多方法,使用列表,dictreader和更多,但没有给我输出我需要的输出。我想比较所有那些具有相同的行!Sample_title和!Sample_geo_accession值(其位置不同)。我现在已经三天苦苦挣扎,无法找到解决方案。我非常感谢任何帮助。
CSV1:
!Sample_title,!Sample_geo_accession,!Sample_status,!Sample_type,!Sample_source_name_ch1
body,GSM501443,Public on july 22 2010,ribonucleic acid,FB_50_12wk
foreign,GSM501445,Public on july 22 2010,ribonucleic acid,FB_0_12wk
HJCENV,GSM501446,Public on july 22 2010,ribonucleic acid,FB_50_12wk
AsDW,GSM501444,Public on july 22 2010,ribonucleic acid,FB_0_12wk
CSV2:
!Sample_title,!Sample_type,!Sample_source_name_ch1,!Sample_geo_accession
AsDW,ribonucleic acid,FB_0,GSM501444
foreign,ribonucleic acid,FB,GSM501449
HJCENV,RNA,12wk,GSM501446
所需的输出(相对于CSV2):
添加了:
{!Sample_status:{HJCENV:Public on july 22 2010,AsDW:Public on july 22 2010}} #Added columns, not rows.
删除:
{} #Since nothing's deleted with respect to CSV2
更改:
{!Sample_title:AsDW,!Sample_source_name_ch1:(FB_0_12wk,FB_0),!Sample_geo_accession:GSM501444
!Sample_title:HJCENV,!Sample_type:(ribonucleic acid,RNA),!Sample_source_name_ch1:(FB_50_12wk,12wk),!Sample_geo_accession:GSM501446}
#foreign,ribonucleic acid,FB,GSM501449 doesn't come here since the !Sample_geo_accession column values didn't match.
编辑:
下面 添加的字典应该为CSV1中的每个!Sample_title(在CSV1和CSV2中的!Sample_title和!Sample_geo_accession匹配时)提供任何其他列及其值(如果它的列数多于CSV2)
删除的字典与添加类似,只是它查找已删除的列。
Changed提供了文件及其标题中不同的值。
所以基本上它应该比较苹果和苹果(当标题名称匹配时),而不是苹果和橙子(按列位置)
答案 0 :(得分:1)
你的问题仍然非常严重。首先,我们必须解码这个问题。 您说"区分两个CSV文件",这通常意味着行方式差异,可能首先按索引列进行逐行重新排序['!Sample_title','!Sample_geo_accession& #39]
但实际上你想要列式差异。具体来说,您想知道在csv2中添加了哪些列,删除了哪些列,以及对于公共列,csv2中更改了哪些条目(行)。 现在,您是否希望这些差异由各个系列计算和显示,或同时在所有列中显示?
如下所示:
import pandas as pd
pd.options.display.width = 200
df1 = pd.read_csv('1.csv', index_col=['!Sample_title','!Sample_geo_accession'])
df2 = pd.read_csv('2.csv', index_col=['!Sample_title','!Sample_geo_accession'])
cols_common = (df1.columns & df2.columns).tolist()
cols_added = (df2.columns - df1.columns).tolist()
cols_deleted = (df1.columns - df2.columns).tolist()
print "\nAdded", df2.ix[:, cols_added]
print "\nDeleted", df1.ix[:, cols_deleted]
print "\nChanged", df2.ix[:, cols_common]
输出:
Added:
[(AsDW, GSM501444), (foreign, GSM501449), (HJCENV, GSM501446)]
Deleted !Sample_status
!Sample_title !Sample_geo_accession
body GSM501443 Public on july 22 2010
foreign GSM501445 Public on july 22 2010
HJCENV GSM501446 Public on july 22 2010
AsDW GSM501444 Public on july 22 2010
Changed !Sample_type !Sample_source_name_ch1
!Sample_title !Sample_geo_accession
AsDW GSM501444 ribonucleic acid FB_0
foreign GSM501449 ribonucleic acid FB
HJCENV GSM501446 RNA 12wk
似乎您还需要我们对列进行重新排序,因此df1,df2的顺序相同。 但是你还没告诉我们应该如何比较'!Sample_source_name_ch1',因为' FB_0_12wk' !=' 12wk'。
在你确定你所要求的清晰度之前,我不会继续这样做。