我有一个数据框:
ID Value
A [2020-05-09 15:21:28,457] [TRUE] [] [com.corp11.Consump] - incoming message: "received message by user1"
B [2020-05-10 12:41:59,497] [FALSE] [] [com.corp11.Consump] - incoming message: "received message by user2"
C [2020-05-11 14:41:49,487] [TRUE] [] [com.corp11.Consump] - "received message by user3"
D [2020-05-12 17:59:59,597] [TRUE] [] [com.corp11.Consump] - incoming message: "received message by user4"
我写了一个带有正则表达式的代码来解析Value列:
df <- df %>%
tidyr::extract(col = "Value",
into = c("timestamp", "ms", "type", "web", "message", "message_text"),
regex = "^\\[(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}),(\\d+)\\] \\[(.*?)\\] \\[\\] \\[(.*?)\\] - (.*?): (?s:(.*))$", remove = FALSE)
dplyr::mutate(
timestamp = anytime::anytime(timestamp),
ms = as.integer(ms)) %>%
subset(select = -Value)
但是我明白了:
ID timestamp ms type web message message_text
A 2020-05-09 15:21:28 457 TRUE com.corp11.Consump incoming message "received message by user1"
B 2020-05-10 12:41:59 497 FALSE com.corp11.Consump incoming message "received message by user2"
C NA NA NA NA NA NA
D 2020-05-12 17:59:59 597 TRUE com.corp11.Consump incoming message "received message by user4"
如您所见,第三行为空。我怎么能用or运算符写我的正则表达式,所以当有消息和没有消息名称时都需要两种情况。因此理想的结果是:
ID timestamp ms type web message message_text
A 2020-05-09 15:21:28 457 TRUE com.corp11.Consump incoming message "received message by user1"
B 2020-05-10 12:41:59 497 FALSE com.corp11.Consump incoming message "received message by user2"
C 2020-05-11 14:41:49 487 TRUE com.corp11.Consump NA "received message by user3"
D 2020-05-12 17:59:59 597 TRUE com.corp11.Consump incoming message "received message by user4"
我该怎么办?