输入看起来像这样:
assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5
如果你在这里注意,第4行是5个是重复的,只有(resid 44 and name H )
和(resid 53 and name H )
被切换。我的理想输出会返回如下:
assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! DUPLICATE ! note string 4 ! note string 5
所以我已经开始使用python中读取文件的典型方法。
txt = open(filename)
print ( lines[0] )
我显然需要捕获(
和)
之间的字符串,然后进行某种类型的搜索。我抓住那些带有正则表达式的东西,这是孩子们的东西。我的想法是在嵌套循环中使用match[0]
和match[1]
并进行搜索。我失败的尝试是:
for i in lines:
# match = re.search("\\(.*?\\)", i)
match = re.findall('\\(.*?\\)',i)
for x in i:
mm = re.search("match[0] match[1]", lines)
print ( mm )
如果我打印它们, match[0]
和match[1]
会给我我想要的东西。执行此搜索的最佳方法是什么,以便保留并传输注释标记?我想象一下,将DUPLICATE
添加到音符字符串中将是微不足道的。
我真的只对python解决方案感兴趣。我还需要将它用于我编写的400行程序。
由于
答案 0 :(得分:2)
使用正则表达式更熟练的人可能会指向一个更好的实现来获取密钥,但将元组存储为密钥并反转以检查它是否已存在应该有效:
lines = """assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5"""
import re
d = {}
r1 = re.compile(r"(?<=\))\s")
r2 = re.compile(r"\(.*\)")
for line in lines.splitlines():
key = tuple(r1.split(r2.findall(line)[0]))
# ("foo","bar") == ("bar","foo") , also check current key is not in d
if tuple(reversed(key)) not in d and key not in d:
d[key] = line
pp(list(d.values()))
输出:
['assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
'string 3',
'assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
'string 2',
'assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
'string 1',
'assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
'string 4']
如果订单问题使用collections.Ordereddict
。我不确定你想要在字符串中添加什么,但这会将DUPLICATE ! string 5
等添加到现有的键值中:
from collections import OrderedDict
d = OrderedDict()
import re
r1 = re.compile(r"(?<=\))\s")
r2 = re.compile(r"\(.*\)")
for line in lines.splitlines():
key = tuple(r1.split(r2.findall(line)[0]))
# (resid 44 and name H ) (resid 53 and name H ) -> (resid 53 and name H ) (resid 44 and name H )
rev_k = tuple(reversed(key))
if rev_k in d:
d[rev_k] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
elif key in d:
d[key] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
else:
d[key] = line
输出:
['assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
'string 1',
'assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
'string 2',
'assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
'string 3',
'assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
'string 4 DUPLICATE ! string 5']
取决于你想做什么,你可以每次附加原始行和DUPLICATE ! string ...
,所以在我们看到dup之前的原始字符串将是第一个元素,其余的将是{{1} }:
DUPLICATE ! string ...
输出:
lines = """assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 6"""
from collections import defaultdict
d = defaultdict(list)
r1 = re.compile(r"(?<=\))\s")
r2 = re.compile(r"\(.*\)")
for line in lines.splitlines():
key = tuple(r1.split(r2.findall(line)[0]))
rev_k = tuple(reversed(key))
if rev_k in d:
d[rev_k].append(line + " DUPLICATE " + " ".join(line.rsplit(None,4)[1:]))
elif key in d:
d[key] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
else:
d[key].append(line)
pp(list(d.values()))
答案 1 :(得分:0)
构建简单字典(或OrderedDict
),其中排序值为键,整行(或注释)为值。
让我们假设这是您想要的独特之处:
>>> re.findall("\(.*?\)", lns[3])
['(resid 44 and name H )', '(resid 53 and name H )']
所以你可以准备排序的密钥:
>>> tmp1 = set(re.findall("\(.*?\)", lns[3])) # Line 4
>>> tmp2 = set(re.findall("\(.*?\)", lns[4])) # Line 5
>>> tmp1
{'(resid 44 and name H )', '(resid 53 and name H )'}
>>> tmp2
{'(resid 44 and name H )', '(resid 53 and name H )'}
>>> tmp1 == tmp2
但set
不可用,因此您必须将其转换为例如tuple
,以便它可用作key for dictionary:
字典的键几乎是任意值。不可清除的值,即包含列表,字典或其他可变类型的值(通过值而不是按对象标识进行比较)不能用作键。
key = tuple(set((re.findall("\(.*?\)", lns[3]))))
不是你只需要存储该行以及可能的键数?
result = {}
with open(filename, 'r') as file:
for line in file:
key = tuple(set((re.findall("\(.*?\)", line))))
if key in result:
result[key][3] += 1
else:
result[key] = [line.strip(), 1]
for line, count in result.values():
print('Seen line', line, count, 'times')
或使用密钥存储每一行:
result = collections.defaultdict(list)
# ...
key = tuple(set((re.findall("\(.*?\)", line))))
result[key].append(line.strip())
# And nice printing
for key, lines in result.items():
print('Seen', key, 'on following lines:')
for l in lines:
print('\t', l)
print()