我正在尝试解析一个包含键值对形式结构的日志文件。
log <- c("name:praveen,age:23,place:UP,address:,dob:, site: {site_name:something , site_url: http://something.com, description:}")
我正在尝试解析这条线我做了一些工作,但我在这里有两个主要问题。
1:我如何解析“site”变量(如上所示),因为对于站点键,有多个键:值对?
2:如果分隔符以字符串形式出现,如何处理条件。比如key:值对分隔符是冒号(:)而在“site”键中有一个键:值对site_url:http://something.com
这里url还包含冒号(:),它给出了错误的答案。
这是我的代码,它不包含“网站”密钥,因为我不知道如何解析它
log <- c("name:praveen,age:23,place:UP,address:,dob:")
names <- setNames(1:5,c("name","age","place","address","dob"))
assign <- function(x, names){
key_value <- sapply(x, function(i)if(length(i)==2L) i else c(i, "nothing"))
z <- rep(NA, length(names))
z[names[key_value[1, ]]] <- key_value[2, ]
z
}
split_by_comma <- strsplit(log,",")
split_by_colon <- lapply(split_by_comma,strsplit,":")
ret <- t(sapply(split_by_colon, assign, names))
colnames(ret) <- names(names)
ret
请帮我解析一下这个文件谢谢
我已使用实际的日志文件格式进行了更新。
{
"username": "lavita",
"host": "10.105.22.32",
"event_source": "server",
"event_type": "/courses/IITB/CS101/2014_T1/xblock/i4x:;_;_IITB;_CS101;_video;_d333fa637a074b41996dc2fd5e675818/handler/xmodule_handler/save_user_state",
"context": {
"course_id": "IITB/CS101/2014_T1",
"course_user_tags": {},
"user_id": 42,
"org_id": "IITB"
},
"time": "2014-06-20T05:49:10.468638+00:00",
"ip": "127.0.0.1",
"event": "{\"POST\": {\"saved_video_position\": [\"00:02:10\"]}, \"GET\": {}}",
"agent": "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:18.0) Gecko/20100101 Firefox/18.0",
"page": null
}
{
"username": "raeha",
"host": "10.105.22.32",
"event_source": "server",
"event_type": "problem_check",
"context": {
"course_id": "IITB/CS101/2014_T1",
"course_user_tags": {},
"user_id": 40,
"org_id": "IITB",
"module": {
"display_name": ""
}
},
"time": "2014-06-20T06:43:52.716455+00:00",
"ip": "127.0.0.1",
"event": {
"submission": {
"i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": {
"input_type": "choicegroup",
"question": "",
"response_type": "multiplechoiceresponse",
"answer": "MenuInflater.inflate()",
"variant": "",
"correct": true
}
},
"success": "correct",
"grade": 1,
"correct_map": {
"i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": {
"hint": "",
"hintmode": null,
"correctness": "correct",
"npoints": null,
"msg": "",
"queuestate": null
}
},
"state": {
"student_answers": {},
"seed": 1,
"done": null,
"correct_map": {},
"input_state": {
"i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": {}
}
},
"answers": {
"i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": "choice_0"
},
"attempts": 1,
"max_grade": 1,
"problem_id": "i4x://IITB/CS101/problem/33e4aac93dc84f368c93b1d08fa984fc"
},
"agent": "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:29.0) Gecko/20100101 Firefox/29.0",
"page": "x_module"
}
{
"username": "tushars",
"host": "localhost",
"event_source": "server",
"event_type": "/courses/IITB/CS101/2014_T1/instructor_dashboard/api/list_instructor_tasks",
"context": {
"course_id": "IITB/CS101/2014_T1",
"course_user_tags": {},
"user_id": 6,
"org_id": "IITB"
},
"time": "2014-06-20T05:49:26.780244+00:00",
"ip": "127.0.0.1",
"event": "{\"POST\": {}, \"GET\": {}}",
"agent": "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:29.0) Gecko/20100101 Firefox/29.0",
"page": null
}
答案 0 :(得分:0)
这是一种非常难看的格式。 True json
会引用字符串和非空值,因此它不是真正的标准格式。这里的方法同样丑陋,但它可以处理多个嵌套元素。
我将此作为测试用例
log <- paste0("name:{first:praveen,last:smith},age:23,place:UP,address:,",
"dob:, site: {site_name:something , site_url: http://something.com, ",
"description:{english:woot,spanish:wooto}}")
这是解析器
parseString<-function(log) {
nested<-c()
#find {} blocks and replace
m<-regexec("\\{[^}{]+?\\}", log)
while(sapply(m, `[`, 1)!=-1) {
s <- gsub("^\\{|\\}$","",sapply(regmatches(log,m), `[`, 1))
regmatches(log,m)<-paste0("~~", length(nested)+seq_along(s), "~~")
nested<-c(nested,s)
m<-gregexpr("\\{([^}{]+)\\}", log)
}
nested<-c(nested, log)
#turn elements into list
nestedl<-vector("list", length(nested))
for(i in seq_along(nested)) {
kv<-strsplit(nested[i], "\\s*,\\s*")[[1]]
kv<-lapply(strsplit(kv, ":"), function(x)
c(x[1], paste(x[-1],collapse=":")))
names <- gsub("\\s+","", sapply(kv, `[`,1))
vals <- gsub("\\s+","", sapply(kv, `[`,2))
valsl <- setNames(as.list(vals), names)
m <- regexec("~~(\\d+)~~", vals)
for(j in which(sapply(m, `[`, 1) != -1)) {
valsl[[j]]<-nestedl[[as.numeric(regmatches(vals[j], m[j])[[1]][2])]]
}
nestedl[[i]]<-valsl
}
nestedl[[length(nestedl)]]
}
所以策略是找到&#34; {}&#34;阻止并将它们折叠成一个简单的字符串,我们可以在以后找到它;在这种情况下,我使用&#34; ~~ 1 ~~&#34;其中中间的数字是每个块的唯一ID。我这样做直到我只有一组名称值对。我回去,寻找所有的&#34; ~~&#34;返回值并合并正确的子列表。对于此测试数据,我得到
#parseString(log)
$name
$name$first
[1] "praveen"
$name$last
[1] "smith"
$age
[1] "23"
$place
[1] "UP"
$address
[1] ""
$dob
[1] ""
$site
$site$site_name
[1] "something"
$site$site_url
[1] "http://something.com"
$site$description
$site$description$english
[1] "woot"
$site$description$spanish
[1] "wooto"