如何解析R中的嵌套键值对

时间:2014-07-18 10:09:20

标签: r logging

我正在尝试解析一个包含键值对形式结构的日志文件。

log <-  c("name:praveen,age:23,place:UP,address:,dob:, site: {site_name:something , site_url: http://something.com, description:}")  

我正在尝试解析这条线我做了一些工作,但我在这里有两个主要问题。

1:我如何解析“site”变量(如上所示),因为对于站点键,有多个键:值对?

2:如果分隔符以字符串形式出现,如何处理条件。比如key:值对分隔符是冒号(:)而在“site”键中有一个键:值对site_url:http://something.com这里url还包含冒号(:),它给出了错误的答案。

这是我的代码,它不包含“网站”密钥,因为我不知道如何解析它

    log <-  c("name:praveen,age:23,place:UP,address:,dob:")  
    names <- setNames(1:5,c("name","age","place","address","dob"))

    assign <- function(x, names){     
      key_value <- sapply(x, function(i)if(length(i)==2L) i else c(i, "nothing"))
      z <- rep(NA, length(names))
      z[names[key_value[1, ]]] <-  key_value[2, ]
      z
    }

    split_by_comma <- strsplit(log,",")
    split_by_colon <- lapply(split_by_comma,strsplit,":")    
    ret <- t(sapply(split_by_colon, assign, names))
    colnames(ret) <- names(names)
    ret

请帮我解析一下这个文件谢谢

我已使用实际的日志文件格式进行了更新。

{
    "username": "lavita",
    "host": "10.105.22.32",
    "event_source": "server",
    "event_type": "/courses/IITB/CS101/2014_T1/xblock/i4x:;_;_IITB;_CS101;_video;_d333fa637a074b41996dc2fd5e675818/handler/xmodule_handler/save_user_state",
    "context": {
        "course_id": "IITB/CS101/2014_T1",
        "course_user_tags": {},
        "user_id": 42,
        "org_id": "IITB"
    },
    "time": "2014-06-20T05:49:10.468638+00:00",
    "ip": "127.0.0.1",
    "event": "{\"POST\": {\"saved_video_position\": [\"00:02:10\"]}, \"GET\": {}}",
    "agent": "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:18.0) Gecko/20100101 Firefox/18.0",
    "page": null
}

{
    "username": "raeha",
    "host": "10.105.22.32",
    "event_source": "server",
    "event_type": "problem_check",
    "context": {
        "course_id": "IITB/CS101/2014_T1",
        "course_user_tags": {},
        "user_id": 40,
        "org_id": "IITB",
        "module": {
            "display_name": ""
        }
    },
    "time": "2014-06-20T06:43:52.716455+00:00",
    "ip": "127.0.0.1",
    "event": {
        "submission": {
            "i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": {
                "input_type": "choicegroup",
                "question": "",
                "response_type": "multiplechoiceresponse",
                "answer": "MenuInflater.inflate()",
                "variant": "",
                "correct": true
            }
        },
        "success": "correct",
        "grade": 1,
        "correct_map": {
            "i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": {
                "hint": "",
                "hintmode": null,
                "correctness": "correct",
                "npoints": null,
                "msg": "",
                "queuestate": null
            }
        },
        "state": {
            "student_answers": {},
            "seed": 1,
            "done": null,
            "correct_map": {},
            "input_state": {
                "i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": {}
            }
        },
        "answers": {
            "i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": "choice_0"
        },
        "attempts": 1,
        "max_grade": 1,
        "problem_id": "i4x://IITB/CS101/problem/33e4aac93dc84f368c93b1d08fa984fc"
    },
    "agent": "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:29.0) Gecko/20100101 Firefox/29.0",
    "page": "x_module"
}


{
    "username": "tushars",
    "host": "localhost",
    "event_source": "server",
    "event_type": "/courses/IITB/CS101/2014_T1/instructor_dashboard/api/list_instructor_tasks",
    "context": {
        "course_id": "IITB/CS101/2014_T1",
        "course_user_tags": {},
        "user_id": 6,
        "org_id": "IITB"
    },
    "time": "2014-06-20T05:49:26.780244+00:00",
    "ip": "127.0.0.1",
    "event": "{\"POST\": {}, \"GET\": {}}",
    "agent": "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:29.0) Gecko/20100101 Firefox/29.0",
    "page": null
}

1 个答案:

答案 0 :(得分:0)

这是一种非常难看的格式。 True json会引用字符串和非空值,因此它不是真正的标准格式。这里的方法同样丑陋,但它可以处理多个嵌套元素。

我将此作为测试用例

log <-  paste0("name:{first:praveen,last:smith},age:23,place:UP,address:,",
"dob:, site: {site_name:something , site_url: http://something.com, ",
"description:{english:woot,spanish:wooto}}")

这是解析器

parseString<-function(log) {
    nested<-c()
    #find {} blocks and replace
    m<-regexec("\\{[^}{]+?\\}", log)
    while(sapply(m, `[`, 1)!=-1) {
        s <- gsub("^\\{|\\}$","",sapply(regmatches(log,m), `[`, 1))
        regmatches(log,m)<-paste0("~~", length(nested)+seq_along(s), "~~")
        nested<-c(nested,s)
        m<-gregexpr("\\{([^}{]+)\\}", log)
    }
    nested<-c(nested, log)

    #turn elements into list
    nestedl<-vector("list", length(nested))
    for(i in seq_along(nested)) {
        kv<-strsplit(nested[i], "\\s*,\\s*")[[1]]
        kv<-lapply(strsplit(kv, ":"), function(x) 
            c(x[1], paste(x[-1],collapse=":")))
        names <- gsub("\\s+","", sapply(kv, `[`,1))
        vals <- gsub("\\s+","", sapply(kv, `[`,2))
        valsl <- setNames(as.list(vals), names)
        m <- regexec("~~(\\d+)~~", vals)
        for(j in which(sapply(m, `[`, 1) != -1)) {
            valsl[[j]]<-nestedl[[as.numeric(regmatches(vals[j], m[j])[[1]][2])]]
        }
        nestedl[[i]]<-valsl
    }
    nestedl[[length(nestedl)]]
}

所以策略是找到&#34; {}&#34;阻止并将它们折叠成一个简单的字符串,我们可以在以后找到它;在这种情况下,我使用&#34; ~~ 1 ~~&#34;其中中间的数字是每个块的唯一ID。我这样做直到我只有一组名称值对。我回去,寻找所有的&#34; ~~&#34;返回值并合并正确的子列表。对于此测试数据,我得到

#parseString(log)
$name
    $name$first
    [1] "praveen"
    $name$last
    [1] "smith"
$age
[1] "23"
$place
[1] "UP"
$address
[1] ""
$dob
[1] ""
$site
    $site$site_name
    [1] "something"
    $site$site_url
    [1] "http://something.com"
    $site$description
        $site$description$english
        [1] "woot"
        $site$description$spanish
        [1] "wooto"