我有一个文件,每行包含一个读取名称; a' +'或' - &#39 ;;一个用数字标记的位置。
我继续先打开文件并使用Python脚本:
#!/usr/bin/env python
import sys
file=open('filepath')
dictionary={}
for line in file:
reads=line.split()
read_name=reads[0]
methylation_state=reads[1] #this is a plus or minus
position=int(reads[2])
我很难建立一个字典,我将{keys:values}作为{methylation_state:position}。
如果有人可以帮助我,我会非常感激。我希望这很清楚。
样品
input1.txt
SRR1035452.21010_CRIRUN_726:7:1101:4566:6721_length=36 + 59399861
SRR1035452.21010_CRIRUN_726:7:1101:4566:6721_length=36 + 59399728
SRR1035452.21010_CRIRUN_726:7:1101:4566:6721_length=36 + 59399735
SRR1035452.21010_CRIRUN_726:7:1101:4566:6721_length=36 + 59399752
SRR1035452.21044_CRIRUN_726:7:1101:5464:6620_length=36 + 31107092
input2.txt
SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 + 18922145
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460469
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460488
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460631
答案 0 :(得分:1)
听起来你只需要简单的职位。一旦知道每个文件中的所有位置,就可以执行多种操作。
def positions(filename):
# split on + and the second element is what we want
return set(line.split('+')[1].strip() for line in open(filename)
if '+' in line)
# get sets from both files
f1 = positions('f1.txt')
f2 = positions('f2.txt')
# with sets, subtraction shows you what is in one not the other
print("in 1 not 2", f1 - f2)
print("in 2 not 1", f2 - f1)
positions
是用python的紧凑“list comprehensions”实现的。我不知道这个名字来自哪里。您可以将其拆分为多个部分以查看具有以下内容的各个步骤,但是一旦习惯了python,第一个实现就很清楚了。
def positions(filename):
# open the file for reading
with open(filename) as fp:
# set will hold positions
pos = set()
# read the file line by line
for line in fp:
# we only care about lines with pluses
if '+' in line:
# split into two parts
parts = line.split()
# position is the second part but we need to get rid of
# extra spaces and newline
position = parts[1].strip()
# add to set. if position is already in set, you don't get
# a second one, this one is dropped
pos.add(position)
return pos
答案 1 :(得分:0)
这是一个使用pandas模块的代码:
from __future__ import print_function
import pandas as pd
df1 = pd.read_csv('a1.txt', names=['read_name','meth_state','position'], usecols=['position', 'meth_state'], delimiter=r'\s+')
df1 = df1[(df1.meth_state == '+')]
print('DF1 %s' % ('-' * 50))
print(df1)
df2 = pd.read_csv('a2.txt', names=['read_name','meth_state','position'], usecols=['position', 'meth_state'], delimiter=r'\s+')
df2 = df2[(df2.meth_state == '+')]
print('DF2 %s' % ('-' * 50))
print(df2)
m1 = pd.merge(df2, df1, how='left', on='position')
print('DF2 - DF1 %s' % ('-' * 50))
print(df2[m1['meth_state_y'].isnull()])
m2 = pd.merge(df1, df2, how='left', on='position')
print('DF1 - DF2 %s' % ('-' * 50))
print(df1[m2['meth_state_y'].isnull()])
输出:
DF1 --------------------------------------------------
meth_state position
0 + 59399861
1 + 59399728
2 + 59399735
3 + 59399752
4 + 31107092
DF2 --------------------------------------------------
meth_state position
0 + 18922145
1 + 51460469
2 + 51460488
3 + 51460631
4 + 31107092
DF2 - DF1 --------------------------------------------------
meth_state position
0 + 18922145
1 + 51460469
2 + 51460488
3 + 51460631
DF1 - DF2 --------------------------------------------------
meth_state position
0 + 59399861
1 + 59399728
2 + 59399735
3 + 59399752
我强烈建议你学习大熊猫 - 这可能会大大简化你未来的工作。