我对NLP文本分类非常陌生,并试图了解其基础知识。看来Spacy更适合我的工作和经验。我已通读所有文档,并使用自己的输出文件夹使用默认的plac参数从https://spacy.io/usage/training#example-textcat运行示例代码。然后,我编写了一个测试文件:
import spacy
output_dir="train_output_orig"
test_text = [
"This movie sucked",
"It's a great one",
"I've watched a lot of films of this kind. A lot of them were more attractive for me",
"This is a great movie",
"This movie is terrible",
"I love this movie",
"This is a bad film",
"So fucking dung!",
"Very involving work with developed characters"
]
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text in test_text:
print(text, nlp2(text).cats)
并获得结果:
Loading from train_output_orig
This movie sucked {'POSITIVE': 0.6549780368804932}
It's a great one {'POSITIVE': 0.7863456606864929}
I've watched a lot of films of this kind. A lot of them were more attractive for me {'POSITIVE': 0.7664909958839417}
This is a great movie {'POSITIVE': 0.7897435426712036}
This movie is terrible {'POSITIVE': 0.4777064323425293}
I love this movie {'POSITIVE': 0.7530838847160339}
This is a bad film {'POSITIVE': 0.46895521879196167}
So fucking dung! {'POSITIVE': 0.6296740174293518}
Very involving work with developed characters {'POSITIVE': 0.8538092970848083}
对于Spacy模型是否可以,还是我做错了什么?我的意思是“正”和“负”标签之间的界限非常狭窄。甚至权威的《这是一部糟糕的电影》也获得了0.46的“积极”评级。 “我喜欢这部电影”仅获得0.75,而“非常喜欢与发达角色合作”获得了0.83。同时,在原始Spacy用法文档短语中建议“这部电影很烂”获得0.65的“积极”得分!
预先感谢您的回答
答案 0 :(得分:0)
文本分类将返回模型中所有标签的分数。越接近0,模型就越不确定。距离1越近,该模型就越确信它是IT。
如果您的负面情绪与您的文字没有太大区别,那么我想您需要更多的培训数据。
答案 1 :(得分:0)
培训数据中的评论通常比上述示例更长。如果您尝试使用数据集中test
中的一些示例,您会看到得分更像{'POSITIVE': 0.9939502477645874, 'NEGATIVE': 0.006049795541912317}
。
此外,使用该示例脚本训练的模型应该在POSITIVE
中同时具有NEGATIVE
和cats
标签,因此,如果只得到POSITIVE
,则可能会有出错了? (不过,这些简短示例的cats
基本上看起来像我期望的那样。)