Question

我有一个文件，每行包含一个读取名称; a＆＃39; +＆＃39;或＆＃39; - ＆＃39 ;;一个用数字标记的位置。

我继续先打开文件并使用Python脚本：

    #!/usr/bin/env python
    import sys

    file=open('filepath')

    dictionary={}

    for line in file:

        reads=line.split()

        read_name=reads[0]

        methylation_state=reads[1] #this is a plus or minus

        position=int(reads[2])

我很难建立一个字典，我将{keys：values}作为{methylation_state：position}。

如果有人可以帮助我，我会非常感激。我希望这很清楚。

样品

input1.txt

SRR1035452.21010_CRIRUN_726:7:1101:4566:6721_length=36 + 59399861
SRR1035452.21010_CRIRUN_726:7:1101:4566:6721_length=36 + 59399728
SRR1035452.21010_CRIRUN_726:7:1101:4566:6721_length=36 + 59399735
SRR1035452.21010_CRIRUN_726:7:1101:4566:6721_length=36 + 59399752
SRR1035452.21044_CRIRUN_726:7:1101:5464:6620_length=36 + 31107092

input2.txt

SRR1035454.47_CRIRUN_726:7:1101:2618:2094_length=36 + 18922145
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460469
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460488
SRR1035454.174_CRIRUN_726:7:1101:6245:2159_length=36 + 51460631

Answer 1

听起来你只需要简单的职位。一旦知道每个文件中的所有位置，就可以执行多种操作。

def positions(filename):
    # split on + and the second element is what we want
    return set(line.split('+')[1].strip() for line in open(filename)
        if '+' in line)

# get sets from both files
f1 = positions('f1.txt')
f2 = positions('f2.txt')
# with sets, subtraction shows you what is in one not the other
print("in 1 not 2", f1 - f2)
print("in 2 not 1", f2 - f1)

positions是用python的紧凑“list comprehensions”实现的。我不知道这个名字来自哪里。您可以将其拆分为多个部分以查看具有以下内容的各个步骤，但是一旦习惯了python，第一个实现就很清楚了。

def positions(filename):
     # open the file for reading
    with open(filename) as fp:
        # set will hold positions
        pos = set()
        # read the file line by line
        for line in fp:
            # we only care about lines with pluses
            if '+' in line:
                # split into two parts
                parts = line.split()
                # position is the second part but we need to get rid of 
                # extra spaces and newline
                position = parts[1].strip()
                # add to set. if position is already in set, you don't get
                # a second one, this one is dropped
                pos.add(position)
    return pos

Answer 2

这是一个使用pandas模块的代码：

from __future__ import print_function
import pandas as pd

df1 = pd.read_csv('a1.txt', names=['read_name','meth_state','position'], usecols=['position', 'meth_state'], delimiter=r'\s+')
df1 = df1[(df1.meth_state == '+')]
print('DF1 %s' % ('-' * 50))
print(df1)

df2 = pd.read_csv('a2.txt', names=['read_name','meth_state','position'], usecols=['position', 'meth_state'], delimiter=r'\s+')
df2 = df2[(df2.meth_state == '+')]
print('DF2 %s' % ('-' * 50))
print(df2)

m1 = pd.merge(df2, df1, how='left', on='position')
print('DF2 - DF1 %s' % ('-' * 50))
print(df2[m1['meth_state_y'].isnull()])

m2 = pd.merge(df1, df2, how='left', on='position')
print('DF1 - DF2 %s' % ('-' * 50))
print(df1[m2['meth_state_y'].isnull()])

输出：

DF1 --------------------------------------------------
  meth_state  position
0          +  59399861
1          +  59399728
2          +  59399735
3          +  59399752
4          +  31107092
DF2 --------------------------------------------------
  meth_state  position
0          +  18922145
1          +  51460469
2          +  51460488
3          +  51460631
4          +  31107092
DF2 - DF1 --------------------------------------------------
  meth_state  position
0          +  18922145
1          +  51460469
2          +  51460488
3          +  51460631
DF1 - DF2 --------------------------------------------------
  meth_state  position
0          +  59399861
1          +  59399728
2          +  59399735
3          +  59399752

我强烈建议你学习大熊猫 - 这可能会大大简化你未来的工作。

从文件中的组件构建字典

2 个答案: