我在csv文件中删除了一些tripadvisor内容(id,引用,评级,完整评论)并尝试过滤掉仅5 *等级的文档,但似乎无法正常工作。
> x <- read.csv ("test.csv", header = TRUE, stringsAsFactors = FALSE)
> (corp <- VCorpus(DataframeSource (x),
+ readerControl = list(language = "eng")))
我得到以下内容:
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 50
现在,在过滤时,它显示有0个文档的评级为5 *且不能正确。
> idx <- meta(corp, "rating") == '5'
> corp [idx]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 0
我是否忽略了创建语料库的任何内容?
按要求输出文字
'data.frame': 682 obs. of 6 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ id : chr "rn360260358" "rn359340351" "rn356397660" "rn355961772" ...
$ quote : chr "Nice but not unique " "Beautiful scenery of German forest with a lake" "Beautiful Lake and Amazing Mountain Views" "Beautiful!" ...
$ rating : chr "3" "5" "5" "5" ...
$ date : chr "Reviewed 5 March 2016" "Reviewed 29 February 2016" "Reviewed 27 February 2016" ...
$ reviewnospace: chr "We visited the lake with our daughters in March. All s...
答案 0 :(得分:0)
您的数据导入方法根本不传递元数据。 DataFrameSource(x)
将x
的所有变量作为文档文本传递。
此外,无论采用何种方法,都没有简单,自动的方式在tm
中添加一堆元数据。相反,我们可以使用VectorSource(x$reviewnospace)
(假设这是保存文本的列),并在第二步中为其分配元数据。然后您的索引按预期工作。
library(tm)
# use VectorSource to import data
corp <- VCorpus(VectorSource(x$reviewnospace), readerControl = list(language = "eng"))
# assign metadata
meta(corp,tag = "rating") <- x$rating
idx <- meta(corp, "rating") == '5'
corp [idx]