用R分割一个非常复杂的字符串

时间:2014-07-07 13:44:24

标签: r string-split

我在两个人之间有50000行对话,其中一行是典型的 会话 2014/07/06, 10:40:42 PM: Franckess ©: I'll leave my student card with him 我想拆分每一行,以便得到一个包含以下变量的数据框:日期;的时间; &的消息即可。 我希望能够使用R(没有Excel)来做,然后我会继续进行情绪分析。

有人可以帮助我吗?

3 个答案:

答案 0 :(得分:4)

这样的事情会起作用吗?

> string <- 
     "2014/07/06, 10:40:42 PM: Franckess ©: I'll leave my student card with him"
> s <- strsplit(string, "(, )|(: )|( [[:print:]]: )")[[1]]
> names(s) <- c("Date", "Time", "Person", "Message")
> data.frame(as.list(s))
#         Date        Time    Person                             Message
# 1 2014/07/06 10:40:42 PM Franckess I'll leave my student card with him

strsplit正则表达式"(, )|(: )|( \\xA9: )"中,我们有

  • (, )查找逗号,然后查找空格
  • |
  • (: )分号,然后是空格
  • |
  • ( \\xA9: )一个空格,然后是版权符号,然后是冒号和另一个空格

要从多个字符串生成数据框,您需要使用do.callrbind之类的函数将它们全部整合在一起。

> dc <- do.call(rbind, strsplit(string, "(, )|(: )|( \\xA9: )"))
> colnames(dc) <- c("Date", "Time", "Person", "Message")
> as.data.frame(dc)
#         Date        Time    Person                             Message
# 1 2014/07/06 10:40:42 PM Franckess I'll leave my student card with him

答案 1 :(得分:1)

你也可以这样做:

  library(stringr)
  str1 <- c("2014/07/06, 10:40:42 PM: Franckess ©: I'll leave my student card with him", "2014/07/06, 10:38:34 PM: Viv M.: I can just fetch it from him")

  str2 <- str_replace_all(str1, perl(':(?= )'),",") 

解释

:(?= ), ",")替换:后跟space替换,

   setNames(as.data.frame(do.call(rbind, str_split(str2, ", ")), stringsAsFactors=F), c("Date", "Time", "Person", "Message")) # split based on `, `
  #       Date        Time      Person                             Message
 #1 2014/07/06 10:40:42 PM Franckess © I'll leave my student card with him
 #2 2014/07/06 10:38:34 PM      Viv M.        I can just fetch it from him

更新

您也可以使用:

  read.csv(text=str2, sep=",",header=F,stringsAsFactors=F)
 #          V1           V2           V3                                   V4
 #1 2014/07/06  10:40:42 PM  Franckess ©  I'll leave my student card with him
 #2 2014/07/06  10:38:34 PM       Viv M.         I can just fetch it from him

答案 2 :(得分:0)

当我解析结构化字符串时,我喜欢使用正则表达式来捕获部分匹配项,然后使用regcapturedmatch.R辅助函数提取这些匹配项。例如

msg<-c("2014/07/06, 10:40:42 PM: Franckess ©: I'll leave my student card with him", "2014/07/06, 10:38:34 PM: Viv M.: I can just fetch it from him")

m<-regexpr("([\\d/]+), ([\\d: AMP]+): (.*): (.*)$",msg, perl=T)

dd<-data.frame(do.call(rbind, regcapturedmatches(msg, m)))

names(dd)<-c("Date","Time","Person","Message")