我遇到以下问题。我们说,我在字典中有两个字符串:
left right
british 7
cuneate nucleus Medulla oblongata
Motoneurons anterior
我在文件中有一些测试行,如下所示:
<s id="69-7">British Meanwhile is the studio 7 album by british pop band 10cc 7.</s>
<s id="5239778-2">Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="21120-99">Terior horn cells, motoneurons located in the spinal.</s>
我希望得到如下输出的输出:
<s id="69-7"><w2>British</w2> Meanwhile is the studio <w2>7</w2> album by <w1>british</w1> pop band 10cc <w2>7</w2>.</s>
<s id="5239778-2"><w2>Medulla oblongata</w2>,the name refers collectively to the <w1>cuneate nucleus</w1> and gracile nucleus, which are present at the junction between the spinal cord and the <w2>medulla oblongata</w2>.</s>
我尝试使用以下代码:
import re
def textReturn(left, right):
text = ""
filetext = open(text.xml, "r").read()
linelist = re.split(u'[\n|\r\n]+',filetext)
for i in linelist:
left = left.strip()
right = right.strip()
if left in i and right in i:
i1 = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', i)
i2 = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', i1)
text = text + i2 + "\n"
return text
但它给了我:
'<s id="69-7">British meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc 7.</s>'.
<s id="5239778-2">Medulla oblongata,the name refers collectively to the <w1>cuneate nucleus</w1> and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="21120-99">Terior horn cells, <w1>motoneurons</w2> located in the spinal.</s>
即如果开头有字符串则无法标记&amp;结束。
另外,我只想要返回那些与左和右相匹配的行。正确的字符串,而不是其他字符串。
请解决任何问题!非常感谢!!!
答案 0 :(得分:3)
它不会在开头和结尾标记,因为您希望在关键字之前和之后一个或多个空格。
而不是\s+
,请使用\b
(分词)。
<强>附录强>
实际代码:
import re
dict = [('british','7'),('cuneate nucleus','Medulla oblongata'),('Motoneurons','anterior')]
filetext = """<s id="69-7">British Meanwhile is the studio 7 album by british pop band 10cc 7.</s>
<s id="5239778-2">Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="21120-99">Terior horn cells, motoneurons located in the spinal.</s>
"""
linelist = re.split(u'[\n|\r\n]+', filetext)
s_tag = re.compile(r"(<s[^>]+>)(.*?)(</s>)")
for i in range(3):
left, right = dict[i]
line_parts = re.search(s_tag, linelist[i])
start = line_parts.group(1)
content = line_parts.group(2)
end = line_parts.group(3)
left_match = "(?i)\\b(%s)\\b" % left
right_match = "(?i)\\b(%s)\\b" % right
if re.search(left_match, content) and re.search(right_match, content):
line1 = re.sub(left_match, '<w1>\\1</w1>', content)
line2 = re.sub(right_match, '<w2>\\1</w2>', line1)
print(line_parts.group(1) + line2 + line_parts.group(3))
这是短期解决方案的基础,但从长远来看,您应该尝试使用XML解析器方法。
答案 1 :(得分:2)
如果您的输入文件将是一个xml文件,为什么不使用xml解析器?见这里:19.5. xml.parsers.expat — Fast XML parsing using Expat