Question

我对NLP文本分类非常陌生，并试图了解其基础知识。看来Spacy更适合我的工作和经验。我已通读所有文档，并使用自己的输出文件夹使用默认的plac参数从https://spacy.io/usage/training#example-textcat运行示例代码。然后，我编写了一个测试文件：

import spacy

output_dir="train_output_orig"

test_text = [
    "This movie sucked",
    "It's a great one",
    "I've watched a lot of films of this kind. A lot of them were more attractive for me",
    "This is a great movie",
    "This movie is terrible",
    "I love this movie",
    "This is a bad film",
    "So fucking dung!",
    "Very involving work with developed characters"
    ]
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text in test_text:
    print(text, nlp2(text).cats)

并获得结果：

Loading from train_output_orig
This movie sucked {'POSITIVE': 0.6549780368804932}
It's a great one {'POSITIVE': 0.7863456606864929}
I've watched a lot of films of this kind. A lot of them were more attractive for me {'POSITIVE': 0.7664909958839417}
This is a great movie {'POSITIVE': 0.7897435426712036}
This movie is terrible {'POSITIVE': 0.4777064323425293}
I love this movie {'POSITIVE': 0.7530838847160339}
This is a bad film {'POSITIVE': 0.46895521879196167}
So fucking dung! {'POSITIVE': 0.6296740174293518}
Very involving work with developed characters {'POSITIVE': 0.8538092970848083}

对于Spacy模型是否可以，还是我做错了什么？我的意思是“正”和“负”标签之间的界限非常狭窄。甚至权威的《这是一部糟糕的电影》也获得了0.46的“积极”评级。 “我喜欢这部电影”仅获得0.75，而“非常喜欢与发达角色合作”获得了0.83。同时，在原始Spacy用法文档短语中建议“这部电影很烂”获得0.65的“积极”得分！

预先感谢您的回答

Answer 1

文本分类将返回模型中所有标签的分数。越接近0，模型就越不确定。距离1越近，该模型就越确信它是IT。

如果您的负面情绪与您的文字没有太大区别，那么我想您需要更多的培训数据。

Answer 2

培训数据中的评论通常比上述示例更长。如果您尝试使用数据集中test中的一些示例，您会看到得分更像{'POSITIVE': 0.9939502477645874, 'NEGATIVE': 0.006049795541912317}。

此外，使用该示例脚本训练的模型应该在POSITIVE中同时具有NEGATIVE和cats标签，因此，如果只得到POSITIVE，则可能会有出错了？（不过，这些简短示例的cats基本上看起来像我期望的那样。）

空间文字分类分数

2 个答案: