Question

我正在使用路透社的数据来练习Weka。 StringtoVector分类器用于转换我的字符串数据（如下所示），因此我可以分析文章以了解哪些单词预测文章类型。如果文章类型为true，则原始数据集表示TRUE / FALSE，但我将其转换为0/1。但是，它拒绝使用“review”字符串上的StringtoVector过滤器为这一个arff文件工作。

我只使用了以下的StringtoVector过滤器，只检查了审核属性：

weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer -M 1 -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\""

我收到此错误： “问题过滤实例：属性名称不唯一。原因：情绪”仅在检查过滤器时进行审核。

以下是我的数据集/格式的标题，用于以下几种情况：

@relation text_files
@attribute review string
@attribute sentiment {0, 1}
@data   "cocoa the the cocoa the early the levels its the the this the ended the mln against at the that cocoa the to crop cocoa to crop around mln sales at mln the to this cocoa export the their cocoa prices to to per to offer sales at to dlrs per to to crop sales to at dlrs at dlrs at dlrs per sales at at at at to dlrs at at dlrs the currency sales at to dlrs dlrs dlrs the currency sales at at dlrs at at dlrs at at sales at mln against the crop mln against the the to to the cocoa commission reuter", 0"prices reserve the agriculture department reported the reserve price loan call price price wheat corn 1986 loan call price price reserves grain wheat per reuter", 0"grain crop their products to to wheat export the export wheat oil oil reuter", 0"inc the stock corp its dlrs oil to dlrs production its the company to its to profit to reuter", 0"products stock split products inc its stock split its common shares shareholders the company its to to shareholders at the the stock mln to mln reuter", 0

任何人对于为什么会发生这种情况有什么想法？我当时认为数据可能包含0和1作为文本中自然出现的单词的一部分这一事实可能存在冲突。我也在想在前一个字符串之后的字符串引号之前可能需要一个额外的空格。

Answer 1

问题是过滤器会将字符串中的每个术语转换为属性。现在必须有一个术语＆＃34;评论＆＃34;或者＆＃34;情绪＆＃34;在您的数据部分。因此，属性是重复的。

所以，更改这两个属性的名称，如＆＃34; myreview＆＃34;和＆＃34; mysentiment＆＃34;或者数据中不太可能发生的事情。它应该工作。

Answer 2

我也遇到了同样的问题，因为数据中出现了“域”这个词，导致过滤器在识别时出现误会。我的解决方案是从数据中删除所有“域”，只保留@attribute 中的“域”。

Answer 3

避免这些属性名称冲突的最简单解决方案是为生成的属性使用前缀。

可以通过 -P 命令行选项、GenericObjectEditor 中的 attributeNamePrefix 选项或 Java 代码中的 setAttributeNamePrefix 方法提供前缀。

请参阅 StringToWordVector 过滤器的 Javadoc。

Weka - StringtoVector Filter不工作

3 个答案: