命名实体识别Python

时间:2015-12-16 22:57:11

标签: python nlp entity named

我想做的事情:提取所有以大写字母开头的n个连续单词的出现次数。

Input: ("Does John Doe eat pizza in New York?", 2)
Output: [("Does", "John"),("John", "Doe")("New","York")]

Input: ("Does John Doe eat pizza in New York?", 3)
Output: [("Does", "John","Doe")]

这是我到目前为止所提出的:

# create text file
fw = open("ngram.txt", "w")
fw.write ("Does John Doe eat pizza in New York?")
fw.close()

def UpperCaseNGrams (file,n):
    fr = open (file, "r")
    text = fr.read().split()

    ngramlist = [text[word:word+n] for word in range(len(text)-(n-1)) if word[0].isupper() if word+n[0].isupper()]  
    return ngramlist

print (UpperCaseNGrams("ngram.txt",2))

我收到以下错误:
TypeError:'int'对象不可订阅

为了让它起作用,我需要更改什么?

1 个答案:

答案 0 :(得分:1)

word+n[0].isupper()中,wordn都属于int类型,因此无法使用[]编制索引,即整数不可订阅。< / p>

我认为你的目的是检查过去当前的第n个字是否以资本开头,然而,这将用text[word+n][0]来完成。无论如何,我不认为您的方法适用于除{2}之外的n值,例如如果n为3,则需要检查当前单词和当前单词之间的所有单词是否大写。

最简单的解决方法是使用all()检查每个单词子列表是否以大写字母开头:

ngramlist = [text[word:word+n] for word in range(len(text)-(n-1))
                 if all(s[0].isupper() for s in text[word:word+n])]

如果你想要更快一些东西,你可以做这样的事情来将大写单词组合在一起:

from itertools import groupby

text = 'Does John Doe eat pizza in New York?'.split()
caps_words = [list(v) for g,v in groupby(text, key=lambda x: x[0].isupper()) if g]
print(caps_words)

这将输出

[['Does', 'John', 'Doe'], ['New', 'York?']]

现在您需要从每次运行中提取长度为n的子列表:

ngrams = []
n = 2
for run in caps_words:
    ngrams.extend(run[i:i+n] for i in range(len(run)-(n-1)))

会产生ngrams

[['Does', 'John'], ['John', 'Doe'], ['New', 'York?']]

n = 3:

[['Does', 'John', 'Doe']]

将所有这些放在一起(并将ngram累加器转换为列表解析)会产生如下函数:

from itertools import groupby

def upper_case_ngrams(words, n):
    caps_words = [list(v) for g,v in groupby(words, key=lambda x: x[0].isupper()) if g]
    return [tuple(run[i:i+n]) for run in caps_words
                for i in range(len(run)-(n-1))]

text = 'Does John Doe eat pizza in New York?'.split()
for n in range(1, 5):
   print(upper_case_ngrams(text, n))

<强>输出

[('Does',), ('John',), ('Doe',), ('New',), ('York?',)]
[('Does', 'John'), ('John', 'Doe'), ('New', 'York?')]
[('Does', 'John', 'Doe')]
[]