Question

我想从大文件中查找并提取所有被特定上下文包围的单词。文件中的所有行看起来都像这样，但>和<\w>之间的单词不同：

<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">and<\w>

我只希望输出为'和'。所以我基本上想要提取上下文>xxx<\w>中的所有字符串（单词，标点符号和数字）。我用grep和regex尝试了很多不同的替代方案，但我要么得到>和<\w>的所有单词或模式...从整个文件中我希望输出看起来像这样：

and 
we
appreciate
this
very 
much
.

等等......

Answer 1

你可以使用这样的模式。这将匹配>和<\w>之间的任何内容。

import re
pat = re.compile(r'>(.*?)<\\w>')
pat.findall(input_string)

Answer 2

确定。给定输入文件具有以下值（我希望我理解您的用例）：

<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">and<\w>
<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">we<\w>
<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">appreciate<\w>
<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">this<\w>
<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">very<\w>
<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">much<\w>
<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">.<\w>

以下python正则表达式适用于您：

>>> import re
>>> pat = re.compile(r'(?<=">)(.*)(?=<\\w>)')
>>> pat.findall(input_string)
['and', 'we', 'appreciate', 'this', 'very', 'much', '.']

使用grep

2 个答案: