我获得了一个相当大的会话数据库,可以将相关信息导入R并运行一些统计分析。
问题是我不需要在每个条目中提供一半的信息。来自数据集的特定JSON文件中的每一行涉及性质 A-> B-> A 的特定对话。提供的属性包含在对话中每个相应语句的嵌套数组中。最好说明diagrammatically:
我需要的是简单地提取' actual_sentence '来自每个回合的属性(turn_1,turn_2,turn_3 - 又名A-> B-> A)并删除其余回合。
到目前为止,我的努力一直都是徒劳的,因为我一直在使用 jsonlite 软件包,它似乎导入了JSON,但是缺少了树的深度'识别每个转弯的特定属性。
一个例子:
以下是提供的JSON格式.txt文件的一行/记录示例:
{"semantic_distance_1": 0.375, "semantic_distance_2": 0.6486486486486487, "turn_2": "{\"sentence\": [\"A\", \"transmission\", \"?\"], \"script_filename\": \"Alien.txt\", \"postag\": [\"AT\", null, \".\"], \"semantic_set\": [\"infection.n.04\", \"vitamin_a.n.01\", \"angstrom.n.01\", \"transmittance.n.01\", \"transmission.n.05\", \"transmission.n.02\", \"transmission.n.01\", \"ampere.n.02\", \"adenine.n.01\", \"a.n.07\", \"a.n.06\", \"deoxyadenosine_monophosphate.n.01\"], \"additional_info\": [], \"original_sentence\": \"A transmission?\", \"actual_sentence\": \"A transmission?\", \"dependency_grammar\": null, \"actor\": \"standard\", \"sentence_type\": null, \"ner\": {}, \"turn_in_file\": 58}", "turn_3": "{\"sentence\": [\"A\", \"voice\", \"transmission\", \".\"], \"script_filename\": \"Alien.txt\", \"postag\": [\"AT\", \"NN\", null, \".\"], \"semantic_set\": [\"vitamin_a.n.01\", \"voice.n.10\", \"voice.n.09\", \"angstrom.n.01\", \"articulation.n.03\", \"deoxyadenosine_monophosphate.n.01\", \"a.n.07\", \"a.n.06\", \"infection.n.04\", \"spokesperson.n.01\", \"transmittance.n.01\", \"voice.n.02\", \"voice.n.03\", \"voice.n.01\", \"voice.n.06\", \"voice.n.07\", \"voice.n.05\", \"voice.v.02\", \"voice.v.01\", \"part.n.11\", \"transmission.n.05\", \"transmission.n.02\", \"transmission.n.01\", \"ampere.n.02\", \"adenine.n.01\"], \"additional_info\": [], \"original_sentence\": \"A voice transmission.\", \"actual_sentence\": \"A voice transmission.\", \"dependency_grammar\": null, \"actor\": \"computer\", \"sentence_type\": null, \"ner\": {}, \"turn_in_file\": 59}", "turn_1": "{\"sentence\": [\"I\", \"have\", \"intercepted\", \"a\", \"transmission\", \"of\", \"unknown\", \"origin\", \".\"], \"script_filename\": \"Alien.txt\", \"postag\": [\"PPSS\", \"HV\", \"VBD\", \"AT\", null, \"IN\", \"JJ\", \"NN\", \".\"], \"semantic_set\": [\"i.n.03\", \"own.v.01\", \"receive.v.01\", \"consume.v.02\", \"accept.v.02\", \"rich_person.n.01\", \"vitamin_a.n.01\", \"have.v.09\", \"have.v.07\", \"nameless.s.01\", \"have.v.01\", \"obscure.s.04\", \"have.v.02\", \"stranger.n.01\", \"angstrom.n.01\", \"induce.v.02\", \"hold.v.03\", \"wiretap.v.01\", \"give_birth.v.01\", \"a.n.07\", \"a.n.06\", \"deoxyadenosine_monophosphate.n.01\", \"infection.n.04\", \"unknown.n.03\", \"unknown.s.03\", \"get.v.03\", \"origin.n.03\", \"origin.n.02\", \"transmittance.n.01\", \"origin.n.05\", \"origin.n.04\", \"one.s.01\", \"have.v.17\", \"have.v.12\", \"have.v.10\", \"have.v.11\", \"take.v.35\", \"experience.v.03\", \"intercept.v.01\", \"unknown.n.01\", \"iodine.n.01\", \"strange.s.02\", \"suffer.v.02\", \"beginning.n.04\", \"one.n.01\", \"transmission.n.05\", \"transmission.n.02\", \"transmission.n.01\", \"ampere.n.02\", \"lineage.n.01\", \"unknown.a.01\", \"adenine.n.01\"], \"additional_info\": [], \"original_sentence\": \"I have intercepted a transmission of unknown origin.\", \"actual_sentence\": \"I have intercepted a transmission of unknown origin.\", \"dependency_grammar\": null, \"actor\": \"computer\", \"sentence_type\": null, \"ner\": {}, \"turn_in_file\": 57}", "syntax_distance_1": null, "syntax_distance_2": null}
正如您所看到的那样,我不需要大量的信息,并且鉴于我对R的了解不足,导入它(以及其中包含的文件的其余部分),格式为leads me to the following in R:
用于此的命令是:
json = fromJSON(paste("[",paste(readLines("JSONfile.txt"),collapse=","),"]"))
基本上它正在接受syntax_distance_1,syntax_distance_2,semantic_distance_1,semantic_distance_2,然后将所有转弯数据集中到三个庞大且非结构化的数组中。
我想知道的是,如果我能以某种方式:
OR
希望这些信息足够,如果我还有其他任何内容可以清除,请告诉我。
答案 0 :(得分:1)
因为在这种情况下你知道你需要更深层次,你可以做的是使用一个apply函数来解析turn_x字符串。以下代码片段说明了基本概念:
# Read the json file
json_file <- fromJSON("JSONfile.json")
# use the apply function to parse the turn_x strings.
# Checking that the element is a character helps avoid
# issues with numerical values and nulls.
pjson_file <- lapply(json_file, function(x) {if (is.character(x)){fromJSON(x)}})
如果我们查看结果,我们会发现这次已经解析了整个数据结构。要访问actual_sentence
字段,您可以执行以下操作:
> pjson_file$turn_1$actual_sentence
[1] "I have intercepted a transmission of unknown origin."
> pjson_file$turn_2$actual_sentence
[1] "A transmission?"
> pjson_file$turn_3$actual_sentence
[1] "A voice transmission."
如果要缩放此逻辑以使其适用于大型数据集,则可以将其封装在一个函数中,该函数可以将三个句子作为字符向量或数据帧返回。