Question

我有类似的问题，我从网上下载了一个大的推文文件将其保存为data.txt并使用rstudio（导入数据集）加载到R中。但有错误，无法继续。

  This is step by step on what i did and the errors i had.

# required packages
library(twitteR)
library(plyr)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(tm)
library(XML)
library(SnowballC) 


 data<- read.csv("~/data/datasStream.txt", header=FALSE , sep = "," )

我有3425个观察值和97个变量

## i load it to corpus

corpus = Corpus(VectorSource(data)) ## 97 elements 17.4 MB

## i cleaned the data using 

corpus = tm_map (corpus, tolower)

corpus = tm_map (corpus, stripWhitespace)

corpus = tm_map (corpus, stemDocument)

corpus = tm_map (corpus, PlainTextDocument)

# remove unnecessary spaces
corpus = gsub("[ \t]{2,}", "", corpus)
corpus = gsub("^\\s+|\\s+$", "", corpus)


# remove NAs in corpus
corpus = corpus[!is.na(corpus)]

dtm = DocumentTermMatrix(corpus)

dtm

<<DocumentTermMatrix (documents: 97, terms: 151132)>>
Non-/sparse entries: 201231/14458573
Sparsity           : 99%
Maximal term length: 1775
Weighting          : term frequency (tf)


adtm <- removeSparseTerms(dtm, 0.75)

adtm
<<DocumentTermMatrix (documents: 97, terms: 270)>>
Non-/sparse entries: 11962/14228
Sparsity           : 54%
Maximal term length: 33
Weighting          : term frequency (tf)


df1 =  as.data.frame (m=as.matrix (adtm)) 

Error in as.data.frame.default(dtm) : cannot coerce class "c("DocumentTermMatrix", "simple_triplet_matrix")" to a data.frame

我该如何解决这个问题？我想用数据执行k-means聚类和词云。

这是一个示例数据：

{“created_at”：“Wed Feb 27 14:24:12 +0000 2013”，“id”：306771719996186625，“id_str”：“306771719996186625”，“text”：“@ Joeypearce我们还有另一个美女来了看到汽车我太干净了： - /我会在工作结束时见到你！X“，”来源“：”\ u003ca href = \“http：//twitter.com/download/iphone \” rel = \“nofollow \”\ u003eTwitter for iPhone \ u003c / a \ u003e“，”截断“：false，”in_reply_to_status_id“：306763650054627328，”in_reply_to_status_id_str“：”306763650054627328“，”in_reply_to_user_id“：127665137，”in_reply_to_user_id_str“：” 127665137“，”in_reply_to_screen_name“：”Joeypearce“，”user“：{”id“：274997668，”id_str“：”274997668“，”name“：”Ell Beaton \ u00a9“，”screen_name“：”Ell_Beaton“，” location“：”“，”url“：null，”description“：”Go Glen，Or Go Home。“，”protected“：false，”followers_count“：147，”friends_count“：85，”listed_count“：0， “created_at”：“Thu Mar 31 12:44:39 +0000 2011”，“favourites_count”：132，“utc_offset”：0，“time_zone”：“London”，“geo_enabled”：true，“valid”：false， “statuses_count”：1087， “郎”： “恩”，“contributors_ena放血 “：假的，” is_translator “：假的，” profile_background_color “：” 1A1B1F “ ”profile_background_image_url“： ”http://a0.twimg.com/profile_background_images/768018009/7a0b3fe303f234e8d6a5429bb9ede9a9.jpeg“， ”profile_background_image_url_https“：” HTTPS： //si0.twimg.com/profile_background_images/768018009/7a0b3fe303f234e8d6a5429bb9ede9a9.jpeg","profile_background_tile":true,"profile_image_url":"http://a0.twimg.com/profile_images/3304123896/606a7413bce208a1a38b1eb41fd017c9_normal.jpeg","profile_image_url_https” ： “https://si0.twimg.com/profile_images/3304123896/606a7413bce208a1a38b1eb41fd017c9_normal.jpeg”， “profile_banner_url”： “https://si0.twimg.com/profile_banners/274997668/1361751912”， “profile_link_color”： “F50E0E” “profile_sidebar_border_color”： “000000”， “profile_sidebar_fill_color”： “252429”， “profile_text_color”： “666666”， “profile_use_background_image”：真实的， “DEFAULT_PROFILE”：假的， “default_profile_image”：假的， “以下”：空” follow_request_sent “：NULL，” 通知 “：空}，” 地理位置 “：{” 类型 “：” 点 “” COOR dinates “：[52.43718380，-2.14324244]}，” 坐标 “：{” 类型 “：” 点 “ ”坐标“：[ - 2.14324244,52.43718380]}， ”地点“：{ ”ID“： ”ddeec3dc241e5b6a“，” url“：”http://api.twitter.com/1/geo/id/ddeec3dc241e5b6a.json“，”place_type“：”city“，”name“：”Dudley“，”full_name“：”Dudley，Dudley“ ，“country_code”：“GB”，“country”：“United Kingdom”，“bounding_box”：{“type”：“Polygon”，“coordinates”：[[[ - 2.191947,52.426012]，[ - 2.191947,52.558221] ，[ - 2.011849,52.558221]，[ - 2.011849,52.426012]]]}， “属性”：{}}， “贡献者”：NULL， “retweet_count”：0， “实体”：{ “＃标签”：[]， “urls”：[]，“user_mentions”：[{“screen_name”：“Joeypearce”，“name”：“Joey Pearce”，“id”：127665137，“id_str”：“127665137”，“indices”：[0 ，11]}]}， “收藏”：假， “转推”：假 “filter_level”： “介质”}

Answer 1

这是R文本挖掘的痛苦之一。 dtm和后续adtm都有两种类型。

class(dtm)
[1] "DocumentTermMatrix"    "simple_triplet_matrix"

换位，术语文档矩阵也是如此。您可以先将dtm，adtm（或者如果您创建tdm）转换为矩阵来解决此问题。我在1000条推文上对此进行了测试，并且能够进行强制操作。< / p>

adtm.m<-as.matrix(adtm)
adtm.df<-as.data.frame(adtm.m)

或者您可以嵌套函数：

adtm.df<-as.data.frame(as.matrix(adtm))

这有点笨拙但完成工作你可以在这里查看新课程。

class(adtm.df)
[1] "data.frame"

Answer 2

这是因为R的强制代码，非常正确的恕我直言，拒绝尝试将任意类转换为数据框。有两个原因。通常的一个问题是，所讨论的类可能是“粗糙的”，即任何变成data.frame的尝试都会产生不等长度的行或列。第二个原因是根本没有定义的强制方法有问题的对象，无论谁写出有问题的包裹都是错误的。据我所知，这是一个非常罕见的情况。

您可能需要手动（例如通过循环或其他构造）提取对象内的记录，并弄清楚如何构建类似矩阵的对象。

as.data.frame.default（dtm）中的错误：无法将类“c（”DocumentTermMatrix“，”simple_triplet_matrix“）强制转换为data.frame

2 个答案: