Question

数据集：两个大型文本文件，用于训练和测试它们的所有单词是否已标记化。一部分数据如下：“富尔顿县大陪审团说，星期五对亚特兰大最近一次大选进行的调查显示，“没有证据”表明发生了任何违规行为。”

问题：如何用Python中的“ unk”一词替换测试数据中没有出现的测试数据中的每个单词？

到目前为止，我通过以下代码制作了字典，以计算文件中每个单词的出现频率：

#open text file and assign it to varible with the name "readfile"
readfile= open('C:/Users/amtol/Desktop/NLP/Homework_1/brown-train.txt','r')

writefile=open('C:/Users/amtol/Desktop/NLP/Homework_1/brown-trainReplaced.txt','w')

# Create an empty dictionary 
d = dict()

# Loop through each line of the file
for line in readfile:

    # Split the line into words 
    words = line.split(" ") 

    # Iterate over each word in line 
    for word in words: 
        # Check if the word is already in dictionary 
        if word in d:

        # Increment count of word by 1 
            d[word] = d[word] + 1
        else: 
            # Add the word to dictionary with count 1 
            d[word] = 1

#replace all words occurring in the training data once with the token<unk>.

for key in list(d.keys()): 
    line= d[key] 
    if (line==1):
        line="<unk>"
        writefile.write(str(d))
    else:
        writefile.write(str(d))

#close the file that we have created and we wrote the new data in that
writefile.close()

老实说，以上代码不适用于writefile.write（str（d）），我想将结果写入新的文本文件中，但是通过print（key，“：”，line）可以正常工作并显示每个单词的频率，但是在控制台中不会创建新文件。如果您也知道原因，请告诉我。

Answer 1

首先，您的任务是替换 test 文件中未出现在 train 文件中的单词。您的代码从不提及测试文件。你必须

读取火车文件，收集其中的单词。基本上没关系；但您需要.strip() line，否则每行的最后一个单词将以换行符结尾。此外，如果您不需要知道计数（并且不知道，只是想知道它是否存在），则使用set而不是dict会更有意义。集合很酷，因为您不必关心元素是否已存在；您只需扔掉它。如果您绝对需要知道计数，那么使用collections.Counter比自己动手做起来容易。
读取 test 文件，并在替换每一行中的单词时写入替换文件。像这样：

以open（“ test”，“ rt”）作为阅读器：使用open（“ replacement”，“ wt”）作为作者：对于阅读器中的行： writer.write（replaced_line（line.strip（））+“ \ n”）
有道理，您的最后一个代码块没有：P而不是查看是否可以查看测试文件中的单词，并替换不可见的单词，而是迭代在训练文件中看到的单词，如果只看过一次，请写<unk>。这可以执行某些操作，但没有执行任何应有的操作。

相反，将您从测试文件中获得的行分割开并迭代其单词；如果单词在可见集合中（按字面意思为word in seen），则替换其内容；最后将其添加到输出语句中。您可以循环执行此操作，但是可以理解以下内容：
```
new_line = ' '.join(word if word in seen else '<unk>'
                    for word in line.split(' '))
```

如何在Python中比较两个大文本文件的内容？

1 个答案: