我刚刚开始在r中进行文本分析。通过阅读一些示例文本数据,我得到以下结果。
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)
> str(sms_raw)
'data.frame': 5559 obs. of 2 variables:
$ type : chr "ham" "ham" "ham" "spam,\"complimentary 4 STAR Ibiza
Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from
Landline not to l"| __truncated__ ...
$ text.........: chr "Hope you are having a good week. Just checking
in;;;;;;;;;" "K..give back my thanks.;;;;;;;;;" "Am also doing in cbe only.
But have to pay.;;;;;;;;;" "" ...
在我看来好像变量没有正确分离。使用head函数进一步分析数据我得到以下结果:
head(sms_raw)
type
1
ham
2
ham
3
ham
4 spam,"complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your
URGENT collection. 09066364349 NOW from Landline not to lose out!
Box434SK38WP150PPM18+";;;;;;;;;
5
spam
6
ham
text.........
1
Hope you are having a good week. Just checking in;;;;;;;;;
2
K..give back my thanks.;;;;;;;;;
3
Am also doing in cbe only. But have to pay.;;;;;;;;;
有人有建议如何解决这个问题吗?
答案 0 :(得分:0)
尝试data.table::fread("sms_spam.csv", stringsAsFactors = FALSE,sep=";")
input_file<-readLines("/path/of/sms_spam.csv")