Question

我想为一个词义消歧项目制作一个TF矢量。我有一些文件，每个文件都包含一个含糊不清的单词，其中包含一个波斯语句子，一个标签，然后是每行中一个英文单词（显示含糊词的意义）。我提取了文件的1000个最频繁的单词，我必须根据这个单词制作TF向量。（TF公式为：，= log（1 +，））。因此输出应该是一个包含1000列数字和一个英文单词列的文件。（1001栏）。

为此，首先我省略了停用词和标点符号。然后我写了一个函数来提取1000个最常用的单词。然后是一个函数来计算文件中每个单词的频率然后我cla，= log（1 +，）。（，是文件中每个单词的频率），并将结果放入字典中（代码中为“TFvauesl”）。问题是代码中的TF函数。

1）例如，如果我的文件有10行，它也应该在输出中返回10行。但代码返回超过10行。我该如何纠正？

2）要制作TF矢量，我必须说文件的每一行中的单词（我将它们放在witouStops列表中）是最常用的单词，也是在TFvalues字典中，而不是那个词，它把它的值放在TFvalues字典中。我写的TF函数没有返回正确的答案。我怎样才能改变它？

3）正如我之前所说，输出文件应该有1000列数字和英文单词列。我的数据文件每行都有一些句子。它们有5到9个单词（或多或少）。对于每行中最常用的单词，代码应该将其TF值放在TFvalues字典中。现在的问题是，对于其余1000个最常用的不在行中的单词，我应该写“0”吗？这是我们制作TF载体的方式吗？提示：我想使用weka进行分类，因此每行的数量应该是相等的。所有这些都应该有1001列。

二进制arff文件的示例： https://www.dropbox.com/s/3ltepz1rk3ia9md/golTest.arff?dl=0

数据文件示例： https://www.dropbox.com/s/f4z8neslw3ht9e8/golTest.txt?dl=0

from hazm import*
from collections import Counter
import collections
import math

punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~،؟«؛«'

file1 = "stopwords.txt"
file2 = "golTest.txt"


witoutStops = []
corpuslines = []

def RemStopWords(file1, file2): 
    .
    .
    .  # do stuffs for omitting stop words and punctuations 

    #print (witoutStops)

def mostFreqWords():
    RemStopWords (file1, file2)
    with open ("TFFile.txt", "w", encoding="utf-8") as f:
        counter = Counter()
        for line in witoutStops:
            line = line.strip().split("\t")
            words = line[0].split()
            counter.update(words)
        top1000 = [word[0] for word in counter.most_common(1000)]
        return top1000

wordcount = {}
TFvalues = {}
def calculateValues():
    RemStopWords (file1, file2)
    for line in witoutStops:
        line = line.strip().split("\t")
        words = line[0].split()
        for word in words:
            if word not in wordcount:
                wordcount[word] = 1    # calculte the frequency of each word
            else:                      # in the file
                wordcount[word] += 1       
            value = wordcount.get(word, 1) # calculte Tf = log (1+ nij)
            result = math.log(1+value)
            TFvalues[word] = result

def TF():
    RemStopWords(file1, file2)
    mostfreq = mostFreqWords()
    calculateValues()
    for line in witoutStops:
        f =open ("abi.arff", "a", encoding = "utf-8") 
        line = line.split("\t")
        words = line[0].split()
        for word in words:
            for i in mostfreq:
                for k, v in TFvalues.items():
                    if word == k:
                        if any([i == word for word in words]):
                            value2 = TFvalues.get(word, 1)
                            f.write(str(value2,))
                        else:
                            f.write("0,")
            f.write(line[1])
TF()

如何制作TF矢量？

0 个答案: