Question

我有两个文件： FILE1.TXT：

ID  Gene    ShortName   TSS
A   ENS1S   Gm16088 TSS82763
B   ENS2S   Gm26206 TSS81070
C   ENS3S   Rp1 TSS11475
D   ENS4S   Gm22848 TSS18078
E   ENS5S   Sox17   TSS56047,TSS74369

FILE2.TXT：

ID  Type    Condition
B   Normal  2
J   Cancer  1
K   Cancer  2
A   Normal  3

我想要的输出是： file1.txt然后添加file2中与第一列匹配的值：

ID  Gene    ShortName   TSS Type    Condition
A   ENS1S   Gm16088 TSS82763    Normal  3
B   ENS2S   Gm26206 TSS81070    Normal  2
C   ENS3S   Rp1 TSS11475        
D   ENS4S   Gm22848 TSS18078    
E   ENS5S   Sox17   TSS56047,TSS74369

因此，将添加file2.txt的Type和Condition列。如果value在file1中但不在file2中，则它将被替换为空单元格。如果value在file2中但不在file1中，则将被忽略。这是我到目前为止尝试过的，它不起作用：输入2个数据帧然后尝试使用数据合并或加入：

  with open('file1.txt') as f:
       r = csv.reader(f, delimiter='\t')
       dict1 = {row[0]: row for row in r}

  with open('file2.txt') as f:
       r = csv.reader(f, delimiter='\t')
       dict2= {row[0]: row for row in r}

  keys =  set(dict1.keys() + dict2.keys()) #i saw this on stackoverlow, i am not sure why it is sorting the keys by alphabetical order and i am unable to unsort (any side tip on that?)

 with open('output.csv', 'wb') as f:
       w = csv.writer(f, delimiter='\t')
       w.writerows([[key, '\t',dict1.get(key),'\t', dict2.get(key)]
           for key in keys])

我也尝试过使用轴1的pd.concat，但这也没有用。

然后我尝试了：

Runtime

这也没有给出所需的输出，字符串之间有很多“'”。有没有建议的方法？我知道如果它具有相同的行数和索引，如何合并到数据框，但如果我只想使用第一个文件作为标准索引，我就无法做到。我知道如何使用merge函数然后使用by.x和by.y在R中执行它，但是R会弄乱我的所有标题名称（这些名称只是一个示例）。所以最好在python中进行。

Answer 1

使用sep='\t'读取文件时没有正确解析，但sep='\s+'为您的示例行做了解析，然后标准merge提供了您想要的结果：

df1 = pd.read_csv('text1.txt', sep='\s+')
df2 = pd.read_csv('text2.txt', sep='\s+')
df1.merge(df2, on='ID', how='left')

  ID   Gene ShortName                TSS    Type  Condition
0  A  ENS1S   Gm16088           TSS82763  Normal          3
1  B  ENS2S   Gm26206           TSS81070  Normal          2
2  C  ENS3S       Rp1           TSS11475     NaN        NaN
3  D  ENS4S   Gm22848           TSS18078     NaN        NaN
4  E  ENS5S     Sox17  TSS56047,TSS74369     NaN        NaN

您当然也可以将“ID”移至index并使用.join()，.concat()或.merge(left_index=True, right_index=True)以及left合并的相应设置为每个人。

Answer 2

您可以使用smtp_server=smtp.gmail.com smtp_port=25 error_logfile=error.log debug_logfile=debug.log auth_username=mail@gmail.com auth_password=pass force_sender=mail@gmail.com合并索引：

join

注意：您可以使用fillna用空字符串填充NaN，但我希望将它们留空（请参阅此post）。

那不能得到以下内容：

In [11]: df1
Out[11]:
     Gene ShortName                TSS
ID
A   ENS1S   Gm16088           TSS82763
B   ENS2S   Gm26206           TSS81070
C   ENS3S       Rp1           TSS11475
D   ENS4S   Gm22848           TSS18078
E   ENS5S     Sox17  TSS56047,TSS74369

In [12]: df2
Out[12]:
      Type  Condition
ID
B   Normal          2
J   Cancer          1
K   Cancer          2
A   Normal          3

In [13]: df1.join(df2, how="outer")
Out[13]:
     Gene ShortName                TSS    Type  Condition
ID
A   ENS1S   Gm16088           TSS82763  Normal          3
B   ENS2S   Gm26206           TSS81070  Normal          2
C   ENS3S       Rp1           TSS11475     NaN        NaN
D   ENS4S   Gm22848           TSS18078     NaN        NaN
E   ENS5S     Sox17  TSS56047,TSS74369     NaN        NaN
J     NaN       NaN                NaN  Cancer          1
K     NaN       NaN                NaN  Cancer          2

In [14]: df1.join(df2, how="left")
Out[14]:
     Gene ShortName                TSS    Type  Condition
ID
A   ENS1S   Gm16088           TSS82763  Normal          3
B   ENS2S   Gm26206           TSS81070  Normal          2
C   ENS3S       Rp1           TSS11475     NaN        NaN
D   ENS4S   Gm22848           TSS18078     NaN        NaN
E   ENS5S     Sox17  TSS56047,TSS74369     NaN        NaN

但我不明白你是如何得到的（ENS4S来自D，而癌症2来自K）。

如果仅存在公共索引，如何组合两个数据帧，否则保留空单元格

2 个答案: