Question

我想在＆＃39;中计算＆＃39; america＆＃39; 和＆＃39; citizen＆＃39; 的数量。以 1789 和 1793 开头的文件上的就职＆＃39; 文件。

cfd = nltk.ConditionalFreqDist(
           (target, file[:4])
               for fileid in inaugural.fileids()
               for w in inaugural.words(fileid)
               for target in ['america', 'citizen']
               if w.lower().startswith(target))

year = ['1789', '1793']
word = ['america', 'citizen']
cfd.tabulate(conditions=year, samples=word)

它没有正确计算这个词。问题是什么？注意：我想展示“美国＆＃39;和＆＃39;公民＆＃39;列为列，年份为行。我的出局：

    america citizen 
1789    0    0 
1793    0    0

Answer 1

以下是算法，您可以使用count函数;

print (mystring.count("specificword"))

演示;

mystring = "hey hey hi hello hey hello hi"
print (mystring.count("hey"))

>>> 
3
>>>

其余的，取决于你。像表一样显示它们基本上用print函数操作它们。另一个演示;

mystring = "hey hey hi hello hey hello hi"

a = mystring.count("hey")
b = mystring.count("hi")
c = mystring.count("hello")

obj = """hey: {}
hi: {}
hello {}"""

print (obj.format(a,b,c))

输出;

>>> 
hey: 3
hi: 2
hello 2
>>>

Answer 2

您可以使用nltk.sent_tokenize创建一个单词列表，然后使用collections.Counter来创建一个字典，单词是其关键字，单词的频率是值：

来自集合导入计数器

with open(file) as f:
        C=Counter(nltk.sent_tokenize(f.lower()))
        B = ['america', 'citizen']
        for i in B:
            print C[i]

Answer 3

您的条件和示例的顺序相反，ConditionalFreqDist构造函数需要condition, sample，但您给它sample, condition。尝试：

cfd = nltk.ConditionalFreqDist(
           (fileid[:4], target)
               for fileid in inaugural.fileids()
               for w in inaugural.words(fileid)
               for target in ['america', 'citizen']
               if w.lower().startswith(target))

A = ['1789', '1793']
B = ['america', 'citizen']
cfd.tabulate(conditions=A, samples=B)

输出

     america citizen 
1789    2    5 
1793    1    1

一般情况下，你想要使用一个词干分析器，因此得到类似的东西：

from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')
cfd = nltk.ConditionalFreqDist(
       (fileid[:4], stemmer.stem(word))
           for fileid in inaugural.fileids()
           for word in inaugural.words(fileid))

A = ['2009', '2005']
B = [stemmer.stem(i) for i in ['freedom', 'war']]
cfd.tabulate(conditions=A, samples=B)

产生输出

     freedom  war 
2009    3    2
2005   27    0

计算指定单词的数量

3 个答案: