我想使用RegexParser将所有连续重叠的名词从文本中分块,例如,我有以下标记文本:
[('APPLE', 'NN'), ('BANANA', 'NN'), ('GRAPE', 'NN'), ('PEAR', 'NN')]
我想提取:
['APPLE BANANA', 'BANANA GRAPE', 'GRAPE PEAR']
我尝试使用以下语法来避免使用匹配的连续名词,但它不起作用:
"CONSEC_NOUNS: {(?=(<NN>{2}))}"
有没有办法做到这一点?
编辑:代码
import nltk
extract = []
grammar = "CONSEC_NOUNS: {(?=(<NN>{2}))}"
cp = nltk.RegexpParser(grammar)
result = cp.parse([('APPLE', 'NN'), ('BANANA', 'NN'), ('GRAPE', 'NN'), ('PEAR', 'NN')])
for elem in result:
if type(elem) == nltk.tree.Tree:
extract.append(' '.join([pair[0] for pair in elem.leaves()]))
>>> print(extract) //[]
// but I want to get ['APPLE BANANA', 'BANANA GRAPE', 'GRAPE PEAR']
答案 0 :(得分:0)
(?<=\()'([^']*)'(?=.*?\('([^']*)')
import re
r = re.compile(r"(?<=\()'([^']*)'(?=.*?\('([^']*)')")
s = "[('APPLE', 'NN'), ('BANANA', 'NN'), ('GRAPE', 'NN'), ('PEAR', 'NN')]"
for m in re.finditer(r, s):
print m.group(1) + ' ' + m.group(2)
(?<=\()
肯定的背后隐藏确保匹配(
字面上的内容'
按字面意思匹配([^']*)
将'
除'
之外的任何字符捕获到捕获组1 (?=.*?\('([^']*)')
按字面意思匹配.*?
确定后续匹配的正向前瞻
\('
任意次数匹配任何字符,但尽可能少('
按字面意思匹配([^']*)
'
将'
除start_ticks=pygame.time.get_ticks()
while mainloop:
seconds=(pygame.time.get_ticks()-start_ticks)/1000
if seconds>10:
break
之外的任何字符捕获到捕获组2 import './App.css';
class App extends Component {
render() {
return (
<div className="App">
<header className="App-header">
<img src={logo} className="App-logo" alt="logo" />
<h1 className="App-title">Welcome to React</h1>
</header>
<p className="App-intro">
To get started, edit <code>src/App.js</code> and save to reload.
</p>
</div>
);
}
}
export default App;
按字面意思匹配答案 1 :(得分:0)
RegexParser
仅产生不重叠的块。我使用NLTK的 bigrams
获得了以下解决方案。
首先,我修改了grammar
以匹配任何2个或更多连续名词。然后我根据结果创建二元组。
import nltk
grammar = "CONSEC_NOUNS: {<NN>{2,}}" # match 2 or more nouns
cp = nltk.RegexpParser(grammar)
result = cp.parse([('APPLE', 'NN'), ('BANANA', 'NN'), ('GRAPE', 'NN'), ('PEAR', 'NN'), ('GO', 'VB'),
('ORANGE', 'NN'), ('STRAWBERRY', 'NN'), ('MELON', 'NN')])
leaves = [chunk.leaves() for chunk in result if ((type(chunk) == nltk.tree.Tree) and chunk.label()=='CONSEC_NOUNS')]
noun_bigram_groups = [list(nltk.bigrams([w for w, t in leaf])) for leaf in leaves]
extract = [' '.join(nouns) for group in noun_bigram_groups for nouns in group]
print(extract)
['APPLE BANANA', 'BANANA GRAPE', 'GRAPE PEAR', 'ORANGE STRAWBERRY', 'STRAWBERRY MELON']