Question

我的POS标签有以下两个字符串：

已发送1 ：“类似于作家专业或用语的工作方式真的很酷。”

[（'something'，'NN'），（'like'，'IN'），（'how'，'WRB'），（'writer'， 'NN'），（'pro'，'NN'），（'或'，'CC'），（'phraseology'，'NN'），（'works'， 'NNS'），（'会'，'MD'），（'be'，'VB'），（'真''，'RB'），（'酷'， 'JJ'），（'。'，'。'）]

已发送2 ：“语法编辑器等更多选项会很好”

[（'more'，'JJR'），（'options'，'NNS'），（'like'，'IN'），（'the'，'DT'），（'syntax'，'NN'），（'editor'，'NN'），（'would'，'MD'），（'be'，'VB'），（'很好'，'JJ'）]

我正在寻找一种方法来检测（返回True），如果有序列：“will”+是“+形容词（无论形容词的位置如何，只要它在”will“”be“之后）在这些字符串中。在第二个字符串中，形容词“nice”紧跟在“将”之后，但在第一个字符串中则不是这样。

琐碎的案例（在形容词之前没有其他词; “会很好”）在我之前的一个问题中被解决了：detecting POS tag pattern along with specified words

我现在正在寻找一种更通用的解决方案，其中可选词可能出现在形容词之前。我是NLTK和Python的新手。

Answer 1

首先按照说明安装nltk_cli：https://github.com/alvations/nltk_cli

然后，这是nltk_cli中的一个秘密函数，也许你会发现它很有用：

alvas@ubi:~/git/nltk_cli$ cat infile.txt 
something like how writer pro or phraseology works would be really cool .
more options like the syntax editor would be nice
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+ADJP infile.txt 
would be    really cool
would be    nice

说明其他可能的用法：

alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+VP infile.txt 
!!! NO CHUNK of VP+VP in this sentence !!!
!!! NO CHUNK of VP+VP in this sentence !!!
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 NP+VP infile.txt 
how writer pro or phraseology works would be
the syntax editor   would be
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+NP infile.txt 
!!! NO CHUNK of VP+NP in this sentence !!!
!!! NO CHUNK of VP+NP in this sentence !!!

然后，如果你想检查句子中的短语并输出True / False，只需阅读并迭代nltk_cli的输出并检查if-else条件。

Answer 2

这会有帮助吗？

s1=[('something', 'NN'), ('like', 'IN'), ('how', 'WRB'), ('writer', 'NN'), ('pro', 'NN'), ('or', 'CC'), ('phraseology', 'NN'), ('works', 'NNS'), ('would', 'MD'), ('be', 'VB'), ('really', 'RB'), ('cool', 'JJ'), ('.', '.')]

flag = True
for i,j in zip(s1[:-1],s1[1:]):
    if i[0]+" "+j[0] == "would be":
        flag = True
    if flag and (i[-1] == "JJ" or j[-1] == "JJ"):
        print "would be adjective found in the tagged string"

Answer 3

看起来你只需要搜索“will”后跟“be”的连续标签，然后搜索标签“JJ”的第一个实例。像这样：

import nltk

def has_would_be_adj(S):
    # make pos tags
    pos = nltk.pos_tag(S.split())
    # Search consecutive tags for "would", "be"
    j = None  # index of found "would"
    for i, (x, y) in enumerate(zip(pos[:-1], pos[1:])):
        if x[0] == "would" and y[0] == "be":
            j = i
            break
    if j is None or len(pos) < j + 2:
        return False
    a = None  # index of found adjective
    for i, (word, tag) in enumerate(pos[j + 2:]):
        if tag == "JJ":
            a = i+j+2 #
            break
    if a is None:
        return False
    print("Found adjective {} at {}", pos[a], a)
    return True

S = "something like how writer pro or phraseology works would be really cool."
print(has_would_be_adj(S))

我确信这可以写成更紧凑，更清洁，但它完成了它在盒子上所说的内容：）

Answer 4

from itertools import tee,izip,dropwhile
import nltk
def check_sentence(S):
    def pairwise(iterable):
        "s -> (s0,s1), (s1,s2), (s2, s3), ..."
        a, b = tee(iterable)
        next(b, None)
        return izip(a, b)


    def consecutive_would_be(word_group):
        first, second = word_group
        (would_word, _) = first
        (be_word, _) = second
        return would_word.lower() != "would" && be_word.lower() != "be"


    for word_groups in dropwhile(consecutive_would_be, pairwise(nltk.pos_tag(nltk.word_tokenize(S))):
        first, second = word_groups
        (_, pos1) = first
        (_, pos2) = second
        if pos1 == "JJ" || pos2 == "JJ":
            return True
    return False

然后您可以使用如下函数：

S = "more options like the syntax editor would be nice."  
check_sentence(S)

Answer 5

检查StackOverflow Link

from nltk.tokenize import word_tokenize
def would_be(tagged):
    return any(['would', 'be', 'JJ'] == [tagged[i][0], tagged[i+1][0], tagged[i+2][1]] for i in xrange(len(tagged) - 2))

S = "more options like the syntax editor would be nice."  
pos = nltk.pos_tag(word_tokenize(S)) 
would_be(pos)

同时检查代码

from nltk.tokenize import word_tokenize
import nltk
def checkTag(S):
    pos = nltk.pos_tag(word_tokenize(S))
    flag = 0
    for tag in pos:
        if tag[1] == 'JJ':
            flag = 1
    if flag:
        for ind,tag in enumerate(pos):
            if tag[0] == 'would' and pos[ind+1][0] == 'be':
                    return True
        return False
    return False

S = "something like how writer pro or phraseology works would be really cool."
print checkTag(S)

匹配POS标签和单词序列

5 个答案: