我在两个人之间有50000行对话,其中一行是典型的
会话
2014/07/06, 10:40:42 PM: Franckess ©: I'll leave my student card with him
我想拆分每一行,以便得到一个包含以下变量的数据框:日期;的时间; 人&的消息即可。
我希望能够使用R(没有Excel)来做,然后我会继续进行情绪分析。
有人可以帮助我吗?
答案 0 :(得分:4)
这样的事情会起作用吗?
> string <-
"2014/07/06, 10:40:42 PM: Franckess ©: I'll leave my student card with him"
> s <- strsplit(string, "(, )|(: )|( [[:print:]]: )")[[1]]
> names(s) <- c("Date", "Time", "Person", "Message")
> data.frame(as.list(s))
# Date Time Person Message
# 1 2014/07/06 10:40:42 PM Franckess I'll leave my student card with him
在strsplit
正则表达式"(, )|(: )|( \\xA9: )"
中,我们有
(, )
查找逗号,然后查找空格|
或(: )
分号,然后是空格|
或( \\xA9: )
一个空格,然后是版权符号,然后是冒号和另一个空格要从多个字符串生成数据框,您需要使用do.call
和rbind
之类的函数将它们全部整合在一起。
> dc <- do.call(rbind, strsplit(string, "(, )|(: )|( \\xA9: )"))
> colnames(dc) <- c("Date", "Time", "Person", "Message")
> as.data.frame(dc)
# Date Time Person Message
# 1 2014/07/06 10:40:42 PM Franckess I'll leave my student card with him
答案 1 :(得分:1)
你也可以这样做:
library(stringr)
str1 <- c("2014/07/06, 10:40:42 PM: Franckess ©: I'll leave my student card with him", "2014/07/06, 10:38:34 PM: Viv M.: I can just fetch it from him")
str2 <- str_replace_all(str1, perl(':(?= )'),",")
:(?= ), ",")
替换:
后跟space
替换,
setNames(as.data.frame(do.call(rbind, str_split(str2, ", ")), stringsAsFactors=F), c("Date", "Time", "Person", "Message")) # split based on `, `
# Date Time Person Message
#1 2014/07/06 10:40:42 PM Franckess © I'll leave my student card with him
#2 2014/07/06 10:38:34 PM Viv M. I can just fetch it from him
您也可以使用:
read.csv(text=str2, sep=",",header=F,stringsAsFactors=F)
# V1 V2 V3 V4
#1 2014/07/06 10:40:42 PM Franckess © I'll leave my student card with him
#2 2014/07/06 10:38:34 PM Viv M. I can just fetch it from him
答案 2 :(得分:0)
当我解析结构化字符串时,我喜欢使用正则表达式来捕获部分匹配项,然后使用regcapturedmatch.R辅助函数提取这些匹配项。例如
msg<-c("2014/07/06, 10:40:42 PM: Franckess ©: I'll leave my student card with him", "2014/07/06, 10:38:34 PM: Viv M.: I can just fetch it from him")
m<-regexpr("([\\d/]+), ([\\d: AMP]+): (.*): (.*)$",msg, perl=T)
dd<-data.frame(do.call(rbind, regcapturedmatches(msg, m)))
names(dd)<-c("Date","Time","Person","Message")