Question

我希望在Visual Studio Code for MacOSX中使用我自己创建的语料库;我已经阅读了大约一百个论坛，但我无法理解我做错的事情，因为我对编程很陌生。

This question似乎是我能找到的关于我需要做的事情的结果;但是，我不知道如何执行以下操作：

＆＃34;例如，在Mac上，它将位于〜/ nltk_data / corpora中。看起来你还必须将你的新语料库附加到...... {site-packages / nltk / corpus /."中的<!DOCTYPE HTML> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>Untitled Document</title> </head> <body> <p id="value">No good deed goes unpunished.</p> <p id="value2">6</p> <script> var preset = "No good deed goes unpunished." var val = document.getElementById("value"); if (JSON.stringify(val.innerText) == "\"No good deed goes unpunished.\""){ document.write("Kind, not stupid!<br><br>"); } // Scandalously lifted from another SO answer... function strcmp(a, b){ return (a < b ? -1 : ( a > b ? 1 : 0 )); } var preset2 = 6; var val2 = document.getElementById("value2"); if(strcmp(preset2, val2) == 0){ document.write("Always kind, but not stupid!"); } </script> </body> </html>;

在回答时，请注意我使用的是Homebrew ，如果我需要在同一编码中使用库存NLTK语料库数据集，请不要永久禁用其他路径

如果需要，我可以使用＆＃34; PlaintextCorpusReader＆＃34;发布我的编码尝试。以及下面提供的回溯，虽然我宁愿不必使用PlaintextCorpusReader进行无缝使用，而宁愿只使用.txt文件的简单复制+粘贴到我希望根据附加编码使用的适当位置。

谢谢。

__init__.py

编辑：

感谢您的回复。

我接受了你的建议并将文件夹移出了NLTK的语料库。

我一直在尝试使用我的文件夹位置，并且我已经获得了不同的追溯。

如果你说最好的方法是使用PlaintextCorpusReader那么就这样吧;但是，也许对于我的应用程序，我想使用CategorizedPlaintextCorpusReader？

sys.argv绝对不是我的意思，所以我可以稍后阅读。

首先，这是我的代码没有尝试使用PlaintextCorpusReader，当文件夹＆＃34; short_reviews＆＃34;包含pos.txt和neg.txt文件在NLP文件夹之外：

Traceback (most recent call last):
  File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 42, in <module>
    short_pos = open("short_reviews/pos.txt", "r").read
IOError: [Errno 2] No such file or directory: 'short_reviews/pos.txt'

然而，当我移动文件夹＆＃34; short_reviews＆＃34;使用与上面相同的代码将文本文件包含到NLP文件夹中但不使用PlaintextCorpusReader会发生以下情况：

import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode

from nltk import word_tokenize

class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

# def main():
#     file = open("short_reviews/pos.txt", "r")
#     short_pos = file.readlines()
#     file.close

short_pos = open("short_reviews/pos.txt", "r").read
short_neg = open("short_reviews/neg.txt", "r").read

documents = []

for r in short_pos.split('\n'):
    documents.append( (r, "pos") )

for r in short_neg.split('\n'):
    documents.append((r, "neg"))

all_words = []

short_pos_words = word.tokenize(short_pos)
short_neg_words = word.tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w. lower())

for w in short_neg_words:
    all_words.append(w. lower())

all_words = nltk.FreqDist(all_words)

当我移动文件夹＆＃34; short_reviews＆＃34;使用下面的代码将文本文件包含到NLP文件夹中并使用PlaintextCorpusReader发生以下Traceback：

Traceback (most recent call last):
  File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 47, in <module>
    for r in short_pos.split('\n'):
AttributeError: 'builtin_function_or_method' object has no attribute 'split'

Answer 1

您提到的答案包含一些非常差（或更确切地说，不适用）的建议。没有理由将自己的语料库放在"de"中，或者将nltk_data加载到本地语料库中加载它。事实上，不做这些事情。

您应该使用nltk.corpus.__init__.py。我不理解您不愿意这样做，但如果您的文件是纯文本，那么它是正确的工具。假设您有一个文件夹PlaintextCorpusReader，您可以构建一个能够加载此文件夹中所有NLP/bettertrainingdata文件的阅读器，如下所示：

.txt

如果您将新文件添加到该文件夹，读者将找到并使用它们。如果你想要的是能够将你的脚本与其他文件夹一起使用，那么就这样做 - 你不需要一个不同的读者，你需要了解myreader = nltk.corpus.reader.PlaintextCorpusReader(r"NLP/bettertrainingdata", r".*\.txt")。如果您使用sys.argv和pos.txt分类语料库，那么您需要neg.txt（参见参考资料）。如果它还有您想要的其他内容，请编辑您的问题以解释您要执行的操作。

如何在NLTK中创建情感分析语料库？

编辑：

1 个答案: