Question

我正在尝试编写一个Python程序，该程序以下列格式读取文件：

ID  chrom   txStart txEnd   score   strand
ENSMUSG00000042429  chr1    1   100 0   -
ENSMUSG00000042429  chr1    110 500 0   -
ENSMUSG00000042500  chr2    12  40  0   -
ENSMUSG00000042500  chr2    200 10000   0   -
ENSMUSG00000042500  chr2    4   50  0   -
ENSMUSG00000042429  chr3    40  33  0   -
ENSMUSG00000025909  chr3    10000   200000  0   -
ENSMUSG00000025909  chr3    1   5   0   -
ENSMUSG00000025909  chr3    400 2000    0   -

然后它以相同的结构输出一个文件，但如果ID是多余的，它会合并行，选择txStart的最小值和 txEnd的最大值。

例如，对于ENSMUSG00000042429，由于它出现两次，它将选择txStart为1而txEnd为500（这些分别是最小值和最大值）。上述数据的预期输出为：

ID  chrom   txStart txEnd   score   strand
ENSMUSG00000042429  chr1    1   500 0   -
ENSMUSG00000042500  chr2    4   10000   0   -
ENSMUSG00000042429  chr3    40  33  0   -
ENSMUSG00000025909  chr3    1   200000  0   -

我无法弄清楚如何完成这项工作。我首先使用pandas在python中读取文件，然后使用以下命令将第一列指定为索引：

data = pd.read_table("Input.txt", sep="\t")

然后我考虑创建字典，其中键是索引，值是剩余的行。那将是：

dictionary = {}
for item in data.index:
    k, v = data.ix[item], data.ix[item, c("chrom", "txStart", "txEnd", "score", "strand"]

这导致了一个错误，我无法弄清楚从哪里开始...什么是获得所需输出的最佳算法？

Answer 1

您使用字典（以记录ID作为键）的想法似乎很好。这是一个大致的概要。

 user_id.push('5648');
 display_name.push('Boody L'Dally');
 Uncaught SyntaxError: missing ) after argument list

该方法假设您可以将整个文件保存在内存中。如果没有，你需要更加小心，一次处理一个块并确保获取共享一个共同ID的所有连续行（假设这些行确实是连续的）。以下是该策略的概述：

records = {}

# Open file and deal with the header line.
with open(...) as fh:
    header = next(fh)

    # Process the input data.
    for line in fh:

        # Parse the line and get the ID. You might need
        # more robust parsing logic, depending on the messiness
        # of the data.
        fields = line.split()
        rec_id = fields[0]

        # Either add a new record, or modify an existing record
        # based on the logic you need.
        if rec_id in records:
            # Modify records[rec_id]
        else:
            records[rec_id] = fields

Answer 2

是的，字典会有所帮助。我想你可以从每一行中获取数据，并将其填入dict中，或者如果条目已存在则更新它：

fp=open('Input.txt')
data={}
def strmin(a,b):
    return str(min(int(a),int(b)));
def strmax(a,b):
    return str(max(int(a),int(b)));
for line in fp:
    ID,chrom,txStart,txEnd,score,strand=line.split()
    if ID=="ID":
        print(line.strip()) # header
        continue
    if not data.has_key(ID):
        data[ID]=[ID,chrom,txStart,txEnd,score,strand]
        continue
    i,c,ts,te,sc,st=data[ID]
    data[ID] = [i,c,strmin(txStart,ts),strmax(txEnd,te),sc,st]

# maybe you want to sort it here...
for ID in data.keys():
    print('\t'.join(data[ID]))

这会产生与预期结果略有不同的东西：

ID  chrom   txStart txEnd   score   strand
ENSMUSG00000042429      chr1    1       500     0       -
ENSMUSG00000042500      chr2    4       10000   0       -
ENSMUSG00000025909      chr3    1       200000  0       -

也许你的意思（ID，chrom）应该是唯一的？只需将密钥更改为包含chrom。

Answer 3

假设您正在创建字典，如@FMc建议，您可以根据需要直接过滤txStart和txEnd值。

如果某个密钥已存在，请将当前值与新值进行比较并替换它，如果它小于（txStart）或更大（对于txEnd）。最后，您将在每个ID的单个字典项中具有每个的最小值和最大值。

解析文件的算法，其中通过最大值和最小值选择冗余索引值

3 个答案: