我正在尝试减少从我的参考管理器获取的bib文本文件,因为它会留下额外的字段,当我把它放入LaTeX时会被破坏。
我要清理的特色条目是:
@Article{Kholmurodov:2001p113,
author = {K Kholmurodov and I Puzynin and W Smith and K Yasuoka and T Ebisuzaki},
journal = {Computer Physics Communications},
title = {MD simulation of cluster-surface impacts for metallic phases: soft landing, droplet spreading and implantation},
abstract = {Lots of text here. Even more text.},
affiliation = {RIKEN, Inst Phys {\&} Chem Res, Computat Sci Div, Adv Comp Ctr, Wako, Saitama 3510198, Japan},
number = {1},
pages = {1--16},
volume = {141},
year = {2001},
month = {Dec},
language = {English},
keywords = {Ethane, molecular dynamics, Clusters, Dl_Poly Code, solid surface, metal, Hydrocarbon Thin-Films, Adsorption, impact, Impact Processes, solid surface, Molecular Dynamics Simulation, Large Systems, DL_POLY, Beam Deposition, Package, Collision-Induced Desorption, Diamond Films, Vapor-Deposition, Transition-Metals, Molecular-Dynamics Simulation},
date-added = {2008-06-27 08:58:25 -0500},
date-modified = {2009-03-24 15:40:27 -0500},
pmid = {000172275000001},
local-url = {file://localhost/User/user/Papers/2001/Kholmurodov/Kholmurodov-MD%20simulation%20of%20cluster-surface%20impacts-2001.pdf},
uri = {papers://B08E511A-2FA9-45A0-8612-FA821DF82090/Paper/p113},
read = {Yes},
rating = {0}
}
我想删除月份,摘要,关键字等字段,其中一些是单行,其中一些是多行。
我在Python中尝试过这样的尝试:
fOpen = open(f,'r')
start_text = fOpen.read()
fOpen.close()
# regex
out_text = re.sub(r'^(month).*,\n','',start_text)
out_text = re.sub(r'^(annote)((.|\n)*?)\},\n','',out_text)
out_text = re.sub(r'^(note)((.|\n)*?)\},\n','',out_text)
out_text = re.sub(r'^(abstract)((.|\n)*?)\},\n','',out_text)
fNew = open(f,'w')
fNew.write(out_text)
fNew.close()
我试图在TextMate中运行这些正则表达式,看看它们是否在Python中尝试之前是否有效,它们看起来没问题。
有什么建议吗?
感谢。
答案 0 :(得分:2)
这个正则表达式怎么样(适用于多行和dotall标志):
^(?:month|annote|note|abstract)\s*=\s*\{(?:(?!\},$).)*\},[\r\n]+
说明:
^ # start-of-line (?: # non-capturing group 1 month|annote|note|abstract # one of these terms ) # end non-capturing group 1 \s*=\s* # whitespace, an equals sign, whitespace \{ # a literal curly brace (?: # non-capturing group 2 (?! # negative look-ahead (if not followed by...) \},$ # a curly brace, a comma and the end-of-line ) # end negative look-ahead . # ...then match next character, whatever it is )* # end non-capturing group 2, repeat \}, # a literal curly brace and a comma [\r\n]+ # at least one end-of-line character
此单个表达式一步即可排除所有受影响的行。
编辑/警告:请注意,只要发生以下情况,此 就会失败:
affiliation = {RIKEN, Inst Phys {\&}, Computat Sci Div, Adv Comp Ctr, Wako, Saitama 3510198, Japan},
正则表达式无法处理嵌套结构。在这种情况下,没有纯正则表达式解决方案在所有情况下都是正确的,您可以获得的最佳结果是一个很好的近似值。
问题是如果你是100%确定上述情况不会发生(而且我认为你不可能) - 或者你是否愿意承担风险。如果您不完全确定这不会有问题 - 请使用或编写解析器。