比较文件的连续列并返回不匹配元素的数量

时间:2015-06-05 03:56:23

标签: python file-handling

我有一个文本文件,如下所示:

2015-06-05 11:39:40.365 temp[3593:50802] Unknown class _TtC9resources30GANavigationMenuViewController in Interface Builder file.
2015-06-05 11:39:40.370 temp[3593:50802] Could not load the "icAccount.png" image referenced from a nib in the bundle with identifier "(null)"
2015-06-05 11:39:40.371 temp[3593:50802] Could not load the "icConnect.png" image referenced from a nib in the bundle with identifier "(null)"
2015-06-05 11:39:40.371 temp[3593:50802] Could not load the "icDiabetesProfile.png" image referenced from a nib in the bundle with identifier "(null)"
2015-06-05 11:39:40.373 temp[3593:50802] Could not load the "icLogout.png" image referenced from a nib in the bundle with identifier "(null)"
2015-06-05 11:39:40.377 temp[3593:50802] Could not load the "icCircle.png" image referenced from a nib in the bundle with identifier "(null)"
2015-06-05 11:39:40.386 temp[3593:50802] *** Terminating app due to uncaught exception 'NSUnknownKeyException', reason: '[<UIViewController 0x7fb3f3d2cdd0> setValue:forUndefinedKey:]: this class is not key value coding-compliant for the key accountButton.'

我想比较连续的列并返回匹配元素的数量。我想用Python做到这一点。早些时候,我使用Bash和AWK(shell脚本)来完成它,但它非常慢,因为我有大量的数据需要处理。我相信Python将是一个更快的解决方案。但是,我对Python很新,我已经有了类似的东西:

# sampleID  HGDP00511  HGDP00511   HGDP00512   HGDP00512   HGDP00513  HGDP00513   

M rs4124251       0       0            A            G          0          A

M rs6650104       0       A            C            T          0          0

M rs12184279      0       0            G            A          T          0

显然不起作用。因为我对Python很陌生,所以我真的不知道要做些什么改变才能让它发挥作用。 (这是代码是完全错误的,我想我可以使用difflib等。但是,我以前从未用Python精通编码,因此,持怀疑态度继续)

我想比较并返回每列中的非匹配元素的数量(从第三列开始)到文件中的每个其他列。我总共有828列。因此我需要828 * 828个输出。 (你可以想到一个n * n矩阵,其中第(i,j)个元素就是它们之间不匹配元素的数量。如果上面的代码片段,我想要的输出是:

for line in open("phased.txt"):
    columns = line.split("\t")

    for i in range(len(columns)-1):
        a = columns[i+3]
        b = columns[i+4]
        for j in range(len(a)):
            if a[j] != b[j]:
                print j

对此有任何帮助将不胜感激。感谢。

2 个答案:

答案 0 :(得分:0)

我强烈建议您使用pandas而不是编写自己的代码:

 function expandSingle(d) {
  if (d._children) {
    d.children = d._children;
    d._children = null;
 }
}

答案 1 :(得分:0)

纯粹的原生python库解决这个问题的方法 - 让我们知道它与bash相比如何828 x 828应该是在公园散步。

元素列数:

为了简单和说明的目的,我特意写了一个翻转序列的步骤 - 你可以通过更改逻辑或类对象的用法,函数装饰器等来改进它...

Code Python 2.7:

shiftcol = 2  # shift columns as first two are to be ignored
with open('phased.txt') as f:
    data = [x.strip().split('\t')[shiftcol:] for x in f.readlines()][1:]

# Step 1: Flipping the data first
flip = []
for idx, rows in enumerate(data):
    for i in range(len(rows)):
        if len(flip) <= i:
            flip.append([])
        flip[i].append(rows[i])

# Step 2: counts store in temp dictionary
for idx, v in enumerate(flip):
    for e in v:
        tmp = {}
        for i, z in enumerate(flip):
            if i != idx and e != '0':
                # Dictionary to store results
                if i+1 not in tmp:  # note has_key will be deprecated
                    tmp[i+1] = {'match': 0, 'notma': 0}
                tmp[i+1]['match'] += z.count(e)
                tmp[i+1]['notma'] += len([x for x in z if x != e])

        # results compensate for column shift..
        for key, count in tmp.iteritems():
            print idx+shiftcol+1, key+shiftcol, ': ', count

示例输出

>>> 3 4 :  {'match': 0, 'notma': 3}
>>> 3 5 :  {'match': 0, 'notma': 3}
>>> 3 6 :  {'match': 2, 'notma': 1}
>>> 3 7 :  {'match': 2, 'notma': 1}
>>> 3 3 :  {'match': 1, 'notma': 2}
>>> 3 4 :  {'match': 1, 'notma': 2}
>>> 3 5 :  {'match': 1, 'notma': 2}