我希望在XML文件中读取,找到包含标记<emotion>
和标记<LOCATION>
的所有句子,然后将这些句子打印到唯一的行。以下是代码示例:
import re
text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer <pronoun> I </pronoun> have ever heard."
out = open('out.txt', 'w')
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bwonderful(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
out.write(line + '\n')
out.close()
正则表达式在这里抓住所有带有“精彩”和“奥马哈”的句子,并返回:
Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>.
哪个是完美的,但我真的想要打印包含<emotion>
和<LOCATION>
的所有句子。但是,出于某种原因,当我用“情感”替换上面的正则表达式中的“精彩”时,正则表达式无法返回任何输出。因此,以下代码不会产生任何结果:
import re
text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard."
out = open('out.txt', 'w')
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
out.write(line + '\n')
out.close()
我的问题是:如何修改我的正则表达式,以便只抓取包含<emotion>
和<LOCATION>
的句子?我非常感谢其他人可以在这个问题上提供的任何帮助。
(为了它的价值,我正在解析我在BeautifulSoup中的文本,但是想在最后一次拍摄之前给出正则表达式。)
答案 0 :(得分:1)
您的问题似乎是您的正则表达式需要一个空格(\s
)来跟随匹配的单词,如下所示:
emotion(?=\s|\.|$)
由于它是标记的一部分,后面跟着>
,而不是空格,因为前瞻失败,所以找不到匹配项。要修复它,您可以在情绪之后添加>
,例如:
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
经过测试,这似乎可以解决您的问题。确保并同样对待“LOCATION”:
for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bLOCATION>(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
line = ''.join(str(x) for x in match)
答案 1 :(得分:0)
如果我不明白你要做的是删除<emotion> </emotion> <LOCATION></LOCATION>
??
如果你想做什么就可以做到这一点
import re
text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard."
out = open('out.txt', 'w')
def remove_xml_tags(xml):
content = re.compile(r'<.*?>')
return content.sub('', xml)
data = remove_xml_tags(text)
out.write(data + '\n')
out.close()
答案 2 :(得分:0)
我刚刚发现可以完全绕过正则表达式。要查找(并打印)包含两个标识的标记类的所有句子,可以使用简单的for循环。如果它可以帮助其他人找到我自己的位置,我会发布我的代码:
# read in your file
f = open('sampleinput.txt', 'r')
# use read method to convert the read data object into string
readfile = f.read()
#########################
# now use the replace() method to clean data
#########################
# replace all \n with " "
nolinebreaks = readfile.replace('\n', ' ')
# replace all commas with ""
nocommas = nolinebreaks.replace(',', '')
# replace all ? with .
noquestions = nocommas.replace('?', '.')
# replace all ! with .
noexclamations = noquestions.replace('!', '.')
# replace all ; with .
nosemicolons = noexclamations.replace(';', '.')
######################
# now use replace() to get rid of periods that don't end sentences
######################
# replace all Mr. with Mr
nomisters = nosemicolons.replace('Mr.', 'Mr')
#replace 'Mrs.' with 'Mrs' etc.
cleantext = nomisters
#now, having cleaned the input, find all sentences that contain your two target words. To find markup, just replace "Toby" and "pipe" with <markupclassone> and <markupclasstwo>
periodsplit = cleantext.split('.')
for x in periodsplit:
if 'Toby' in x and 'pipe' in x:
print x