我对Python有点新,并尝试使用Pandas模块。下面是我的示例文件(每行的第一个元素是read_name;第二个元素是methylation_state;第三个是位置)。
我的目标是首先在input_sample1.txt和input_sample2.txt 中提取所有我能够做的
第二次合并两个数据帧以提取第一个DF而不是第二个DF的位置;然后提取第二个DF中的位置,而不是第二个DF中的位置。
这是我到目前为止所得到的并且m1和m2 DF都出错,并出现以下错误:
UserWarning:Boolean系列键将重新编制索引以匹配DataFrame索引。 “DataFrame index。”,UserWarning)
#!/usr/bin/env python
from __future__ import print_function
import pandas as pd
import sys
import pandas as pd
df1=pd.read_csv('Input_Sample1.txt', names=['read_name', 'methylation_state', 'position'], usecols=['position', 'methylation_state'], delimiter=r'\s+')
df1 = df1[(df1.methylation_state == '+')]
# print('df1 %s' % ('-' * 50))
# print(df1)
df2=pd.read_csv('Input_Sample2.txt', names=['read_name','methylation_state','position'], usecols=['position', 'methylation_state'], delimiter=r'\s+')
df2 = df2[(df2.methylation_state == '+')]
#print('df2 %s' % ('-' * 50))
#print(df2)
#get an error for the following merged dataframes m1 and m2:
m1=pd.merge(df1, df2, how='left', on='position')
print('df2 - df1 %s' % ('-' * 50))
print(df2[m1['methylation_state_y'].isnull()])
m2 = pd.merge(df1, df2, how='left', on='position')
print('df1 - df2 %s' % ('-' * 50))
print(df1[m2['methylation_state_y'].isnull()])
Input_Sample1.txt:
SRR1035452.114_CRIRUN_726:7:1101:3884:2095_length=36 + 37151024
SRR1035452.114_CRIRUN_726:7:1101:3884:2095_length=36 + 37151031
SRR1035452.114_CRIRUN_726:7:1101:3884:2095_length=36 + 37151189
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36 + 23189251
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36 + 23189248
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36 + 23189242
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36 + 23189086
SRR1035452.117_CRIRUN_726:7:1101:3789:2132_length=36 + 23189101
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644021
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644026
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644032
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644038
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644042
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644050
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644055
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644267
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644253
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644246
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644240
SRR1035452.211_CRIRUN_726:7:1101:5833:2115_length=36 + 60644236
SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length=36 + 26775201
SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length=36 + 26775193
SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length=36 + 26775178
SRR1035452.336_CRIRUN_726:7:1101:8029:2240_length=36 + 26775012
SRR1035452.377_CRIRUN_726:7:1101:9240:2160_length=36 + 27851064
SRR1035452.377_CRIRUN_726:7:1101:9240:2160_length=36 + 27851253
INPUT_SAMPLE2.txt文件:
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 - 18921902
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 + 18921911
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 + 18921919
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 + 18921926
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 + 18922145
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460469
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460488
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460631
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460613
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460608
SRR1035454.215_CRIRUN_726:7:1101:7106:2100_length=36 - 30309836
SRR1035454.216_CRIRUN_726:7:1101:7129:2116_length=36 + 31856610
SRR1035454.216_CRIRUN_726:7:1101:7129:2116_length=36 + 31856602
SRR1035454.216_CRIRUN_726:7:1101:7129:2116_length=36 + 31856255
SRR1035454.270_CRIRUN_726:7:1101:8134:2171_length=36 + 26078372
SRR1035454.270_CRIRUN_726:7:1101:8134:2171_length=36 + 26078363
SRR1035454.306_CRIRUN_726:7:1101:9223:2098_length=36 + 55329938
SRR1035454.348_CRIRUN_726:7:1101:10157:2107_length=36 + 40179303
SRR1035454.348_CRIRUN_726:7:1101:10157:2107_length=36 + 40179299
SRR1035454.348_CRIRUN_726:7:1101:10157:2107_length=36 + 40179018
DF1输入的一部分:
0 + 37151024
1 + 37151031
2 + 37151189
3 + 23189251
4 + 23189248
5 + 23189242
6 + 23189086
7 + 23189101
8 + 60644021
9 + 60644026
10 + 60644032
11 + 60644038
12 + 60644042
13 + 60644050
14 + 60644055
15 + 60644267
16 + 60644253
17 + 60644246
18 + 60644240
DF2输出的一部分:
methylation_state position
1 + 18921911
2 + 18921919
3 + 18921926
4 + 18922145
5 + 51460469
6 + 51460488
7 + 51460631
8 + 51460613
9 + 51460608
11 + 31856610
12 + 31856602
13 + 31856255
14 + 26078372
请注意 每个文本文件包含大约80k行。非常感谢任何帮助/建议!!
答案 0 :(得分:0)
如下所示
df1 = pd.read_csv('Input_Sample1.txt', names=['read_name', 'methylation_state', 'position'], usecols=['position', 'methylation_state'], delimiter=r'\s+')
df2 = pd.read_csv('Input_Sample2.txt', names=['read_name', 'methylation_state', 'position'], usecols=['position', 'methylation_state'], delimiter=r'\s+')
m1 = df1.merge(df2, how='left') # merge one that exist on df1
m2 = df2.merge(df1, how='left') # merge one that exist on df2
答案 1 :(得分:-1)
试试这个:
#!/usr/bin/env python
from __future__ import print_function
import sys
import pandas as pd
sys.stdout=open('CHG_comparison.txt', 'w')
ESfemale=pd.read_csv('Input_Sample1.txt', names=['read_name', 'methylation_state', 'position'], usecols=['position', 'methylation_state'], delimiter=r'\s+')
ESfemale = ESfemale[(ESfemale.methylation_state == '+')]
# print('ESfemale CHF context of all methylation sites %s' % ('-' * 50))
# print(ESfemale)
EpiSC=pd.read_csv('Input_Sample2.txt', names=['read_name','methylation_state','position'], usecols=['position', 'methylation_state'], delimiter=r'\s+')
EpiSC = EpiSC[(EpiSC.methylation_state == '+')]
#print('EpiSC CHG context of all methylation sites %s' % ('-' * 50))
#print(EpiSC)
#print(ESfemale[['methylation_state', 'position']].isin(EpiSC.to_dict(orient='list')))
diff1 = ESfemale.ix[~ESfemale[['methylation_state', 'position']].isin(EpiSC.to_dict(orient='list')).all(axis=1)]
print(diff1)
diff1.to_csv('diff1.csv')
diff2 = EpiSC.ix[~EpiSC[['methylation_state', 'position']].isin(ESfemale.to_dict(orient='list')).all(axis=1)]
print(diff2)
diff2.to_csv('diff2.csv')
PS样本文件中没有“相交”的集合,所以我不得不将文件1中的几行添加到文件2中,反之亦然,以便对其进行测试。