我想做的事情:提取所有以大写字母开头的n个连续单词的出现次数。
Input: ("Does John Doe eat pizza in New York?", 2)
Output: [("Does", "John"),("John", "Doe")("New","York")]
Input: ("Does John Doe eat pizza in New York?", 3)
Output: [("Does", "John","Doe")]
这是我到目前为止所提出的:
# create text file
fw = open("ngram.txt", "w")
fw.write ("Does John Doe eat pizza in New York?")
fw.close()
def UpperCaseNGrams (file,n):
fr = open (file, "r")
text = fr.read().split()
ngramlist = [text[word:word+n] for word in range(len(text)-(n-1)) if word[0].isupper() if word+n[0].isupper()]
return ngramlist
print (UpperCaseNGrams("ngram.txt",2))
我收到以下错误:
TypeError:'int'对象不可订阅
为了让它起作用,我需要更改什么?
答案 0 :(得分:1)
在word+n[0].isupper()
中,word
和n
都属于int
类型,因此无法使用[]
编制索引,即整数不可订阅。< / p>
我认为你的目的是检查过去当前的第n个字是否以资本开头,然而,这将用text[word+n][0]
来完成。无论如何,我不认为您的方法适用于除{2}之外的n
值,例如如果n
为3,则需要检查当前单词和当前单词之间的所有单词是否大写。
最简单的解决方法是使用all()
检查每个单词子列表是否以大写字母开头:
ngramlist = [text[word:word+n] for word in range(len(text)-(n-1))
if all(s[0].isupper() for s in text[word:word+n])]
如果你想要更快一些东西,你可以做这样的事情来将大写单词组合在一起:
from itertools import groupby
text = 'Does John Doe eat pizza in New York?'.split()
caps_words = [list(v) for g,v in groupby(text, key=lambda x: x[0].isupper()) if g]
print(caps_words)
这将输出
[['Does', 'John', 'Doe'], ['New', 'York?']]
现在您需要从每次运行中提取长度为n
的子列表:
ngrams = []
n = 2
for run in caps_words:
ngrams.extend(run[i:i+n] for i in range(len(run)-(n-1)))
会产生ngrams
:
[['Does', 'John'], ['John', 'Doe'], ['New', 'York?']]
和n
= 3:
[['Does', 'John', 'Doe']]
将所有这些放在一起(并将ngram累加器转换为列表解析)会产生如下函数:
from itertools import groupby
def upper_case_ngrams(words, n):
caps_words = [list(v) for g,v in groupby(words, key=lambda x: x[0].isupper()) if g]
return [tuple(run[i:i+n]) for run in caps_words
for i in range(len(run)-(n-1))]
text = 'Does John Doe eat pizza in New York?'.split()
for n in range(1, 5):
print(upper_case_ngrams(text, n))
<强>输出强>
[('Does',), ('John',), ('Doe',), ('New',), ('York?',)] [('Does', 'John'), ('John', 'Doe'), ('New', 'York?')] [('Does', 'John', 'Doe')] []