如何在非结构化数据中的特定字符串之前提取日期?

时间:2017-03-25 17:01:25

标签: r

我的非结构化文本中包含很多日期,我想在" Message" 之前提取日期。我看到的数据如下:

 $result = mysql_query($qry, $conn);

并且输出将是一个新的数据框,其中包含一列日期:

21 March 2017 23:10:45 text1
21 March 2017 23:10:45  More text…..
21 March 2017 23:10:45 And more text …..
21 March 2017 23:10:45 some more text **Message:** more text 
22 March 2017 23:10:45 text1
22 March 2017 23:10:45  More text…..
22 March 2017 23:10:45 And more text …..
22 March 2017 23:10:45 some more text **Message:** more text 
23 March 2017 23:10:45 text1
23 March 2017 23:10:45  More text…..
23 March 2017 23:10:45 And more text …..
23 March 2017 23:10:45 some more text **Message:** more text 
24 March 2017 23:10:45 text1
24 March 2017 23:10:45  More text…..
24 March 2017 23:10:45 And more text …..
24 March 2017 23:10:45 some more text **Message:** more text 

1 个答案:

答案 0 :(得分:3)

怎么样

sub("(?<=\\d{4}).*", "", grep("Message", txt, value=TRUE), perl=TRUE)
# [1] "21 March 2017" "22 March 2017" "23 March 2017" "24 March 2017"

首先,我们使用grep()txt简化为仅包含“消息”的值,然后使用sub()删除第一次出现四位数后的所有文本。< / p>

数据:

txt <- readLines(textConnection("21 March 2017 23:10:45 text1
21 March 2017 23:10:45  More text…..
21 March 2017 23:10:45 And more text …..
21 March 2017 23:10:45 some more text **Message:** more text 
22 March 2017 23:10:45 text1
22 March 2017 23:10:45  More text…..
22 March 2017 23:10:45 And more text …..
22 March 2017 23:10:45 some more text **Message:** more text 
23 March 2017 23:10:45 text1
23 March 2017 23:10:45  More text…..
23 March 2017 23:10:45 And more text …..
23 March 2017 23:10:45 some more text **Message:** more text 
24 March 2017 23:10:45 text1
24 March 2017 23:10:45  More text…..
24 March 2017 23:10:45 And more text …..
24 March 2017 23:10:45 some more text **Message:** more text 
"))