Python使用Note String Preservation查找重复项

时间:2015-03-04 20:06:26

标签: python regex

输入看起来像这样:

assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5

如果你在这里注意,第4行是5个是重复的,只有(resid 44 and name H )(resid 53 and name H )被切换。我的理想输出会返回如下:

assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! DUPLICATE ! note string 4 ! note string 5

所以我已经开始使用python中读取文件的典型方法。

txt = open(filename)

print ( lines[0] )

我显然需要捕获()之间的字符串,然后进行某种类型的搜索。我抓住那些带有正则表达式的东西,这是孩子们的东西。我的想法是在嵌套循环中使用match[0]match[1]并进行搜索。我失败的尝试是:

for i in lines:
#   match = re.search("\\(.*?\\)", i)
    match = re.findall('\\(.*?\\)',i)
    for x in i:
        mm = re.search("match[0] match[1]", lines)
        print ( mm )
如果我打印它们,

match[0]match[1]会给我我想要的东西。执行此搜索的最佳方法是什么,以便保留并传输注释标记?我想象一下,将DUPLICATE添加到音符字符串中将是微不足道的。

我真的只对python解决方案感兴趣。我还需要将它用于我编写的400行程序。

由于

2 个答案:

答案 0 :(得分:2)

使用正则表达式更熟练的人可能会指向一个更好的实现来获取密钥,但将元组存储为密钥并反转以检查它是否已存在应该有效:

lines = """assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5"""

import re

d = {}

r1 = re.compile(r"(?<=\))\s")
r2 = re.compile(r"\(.*\)")

for line in lines.splitlines():
    key = tuple(r1.split(r2.findall(line)[0]))
    # ("foo","bar") == ("bar","foo") , also check current key is not in d
    if tuple(reversed(key)) not in d and key not in d:
        d[key] = line

pp(list(d.values()))

输出:

['assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
 'string 3',
 'assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
 'string 2',
 'assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
 'string 1',
 'assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
 'string 4']

如果订单问题使用collections.Ordereddict。我不确定你想要在字符串中添加什么,但这会将DUPLICATE ! string 5等添加到现有的键值中:

from collections import OrderedDict

d = OrderedDict()
import re

r1 = re.compile(r"(?<=\))\s")
r2 = re.compile(r"\(.*\)")
for line in lines.splitlines():
    key = tuple(r1.split(r2.findall(line)[0])) 
      # (resid 44 and name H ) (resid 53 and name H ) ->  (resid 53 and name H ) (resid 44 and name H )   
    rev_k = tuple(reversed(key))
    if rev_k in d:
        d[rev_k] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
    elif key in d:
        d[key] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
    else:
        d[key] = line

输出:

['assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
 'string 1',
 'assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
 'string 2',
 'assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
 'string 3',
 'assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
 'string 4 DUPLICATE ! string 5']

取决于你想做什么,你可以每次附加原始行和DUPLICATE ! string ...,所以在我们看到dup之前的原始字符串将是第一个元素,其余的将是{{1} }:

DUPLICATE ! string ...

输出:

lines = """assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 6"""

from collections import defaultdict


d = defaultdict(list)
r1 = re.compile(r"(?<=\))\s")
r2 = re.compile(r"\(.*\)")

for line in lines.splitlines():
    key = tuple(r1.split(r2.findall(line)[0]))
    rev_k = tuple(reversed(key))
    if rev_k in d:
        d[rev_k].append(line + " DUPLICATE " + " ".join(line.rsplit(None,4)[1:]))
    elif key in d:
            d[key] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
    else:
        d[key].append(line)


    pp(list(d.values()))

答案 1 :(得分:0)

构建简单字典(或OrderedDict),其中排序值为键,整行(或注释)为值。

让我们假设这是您想要的独特之处:

>>> re.findall("\(.*?\)", lns[3])
['(resid 44 and name H )', '(resid 53 and name H )']

所以你可以准备排序的密钥:

>>> tmp1 = set(re.findall("\(.*?\)", lns[3])) # Line 4
>>> tmp2 = set(re.findall("\(.*?\)", lns[4])) # Line 5
>>> tmp1
{'(resid 44 and name H )', '(resid 53 and name H )'}
>>> tmp2
{'(resid 44 and name H )', '(resid 53 and name H )'}
>>> tmp1 == tmp2

set 不可用,因此您必须将其转换为例如tuple,以便它可用作key for dictionary

  

字典的键几乎是任意值。不可清除的值,即包含列表,字典或其他可变类型的值(通过值而不是按对象标识进行比较)不能用作键。

key = tuple(set((re.findall("\(.*?\)", lns[3]))))

不是你只需要存储该行以及可能的键数?

result = {}

with open(filename, 'r') as file:
    for line in file:
        key = tuple(set((re.findall("\(.*?\)", line))))

        if key in result:
            result[key][3] += 1
        else:
            result[key] = [line.strip(), 1]

for line, count in result.values():
    print('Seen line', line, count, 'times')

或使用密钥存储每一行​​:

result = collections.defaultdict(list)

# ...
        key = tuple(set((re.findall("\(.*?\)", line))))

        result[key].append(line.strip())

# And nice printing
for key, lines in result.items():
    print('Seen', key, 'on following lines:')
    for l in lines:
        print('\t', l)
    print()