Question

我有非结构化的文本文件，我需要从结构化格式中提取一些数据。数据如下所示（每条记录扩展到多行：

2017年3月21日23:10:45文字2017年3月21日23:10:45更多文字...... 。 2017年3月21日23:10:45更多文字2017年3月21日23:10:45更多文字消息：更多文字1更多文字2更多文字3更多文字4

2017年3月22日23:10:45文字2017年3月22日23:10:45更多文字...... 。 2017年3月23日23:10:45更多文字2017年3月23日23:10:45更多文字消息：更多文字1更多文字2更多文字3更多文字4

下面的代码在单独的列中提取单词“Message”之后的所有内容（更多text1，更多text2，更多text3，更多text4）。我想修改它以包含“消息”一词之前的日期。这是我的代码：

#Read data
m <- SReadLines("C:/user...", SkipNull=TRUE)

#reomve special characters that might affect reading the data later:
m <- sapply(m, function(i) {
b <- gsub("\032"," ",i)
gsub("\t","",b)
})

#convert to one big character string
m <- paste(m, collapse="")

#since some entries expand on multiple lines, will replace the date
#(which prepend each piece of information in the file) with a carrot, 
#the replace     new line characters with blanks, then replace carrots 
#with new lines. At the end all texts will on one line:

date_pattern <- "\\[[0-9]{2}\\-[A-Z]{1}[a-z]{2}\\-[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}"

m <- gsub(data+pattern, "^", m)
m <- gsub("\n","",m)
m <- gsub("\\^", "\n", m)


#only keep lines with the word "Message"
m <- a[Grep("Message",m)]
class(m) <- "character"
#remove the word "message and trim leading white space:
m <- sapply(strsplit(m,split = "Message", fixed=TRUE), function(i) (i[2]))
m <- trimws(m, which="left")

#write to file:
writeLines(m, "C:/user...")

上述代码的结果是在单独的列中单词“Message”（更多text1，more text2，more text3，more text4）之后的所有内容。

我需要修改上面的代码来添加日期，有什么建议吗？我能够自己提取日期并尝试将其合并到我使用cbind提取的数据中，但我在一列中有一天，第二列有月份，第三列有一年。

Answer 1

这里有一些使用贪婪匹配的perl技巧可能会帮助你。

首先获取一些数据进行测试

x <- "21 March 2017 23:10:45 text 21 March 2017 23:10:45 More text. 21 March 2017 23:10:45 And more text 21 March 2017 23:10:45 some more text Message: more text1 more text2 more text3 more text4"

然后定义日期模式（与上面的模式略有不同。注意月份写成全长）

date_pattern <- "[0-9]{2} [A-Z]{1}[a-z]+ [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}"

使用gsub和后退引用来获得您想要的内容：

gsub(paste0("(.*)(", date_pattern , ")(.*)Message: (.*)"), "\\2  \\4", x)

产生

"21 March 2017 23:10:45  more text1 more text2 more text3 more text4"

您可以在gsub的输出中插入内容，以防您想要更紧密地分开。

如何修改此代码以在新列中包含日期？

1 个答案: