我正在使用Amazon Elastic Map Reduce命令行工具将许多系统调用粘在一起。这些命令返回已经(部分?)转义的JSON文本。然后,当系统调用将其转换为R文本对象(实习生= T)时,它似乎再次被转义。我需要清理它,以便它将使用rjson包进行解析。
我以这种方式进行系统调用:
system("~/EMR/elastic-mapreduce --describe --jobflow j-2H9P770Z4B8GG", intern=T)
返回:
[1] "{"
[2] " \"JobFlows\": ["
[3] " {"
[4] " \"LogUri\": \"s3n:\\/\\/emrlogs\\/\","
[5] " \"Name\": \"emrFromR\","
[6] " \"BootstrapActions\": ["
...
但命令行中的相同命令返回:
{
"JobFlows": [
{
"LogUri": "s3n:\/\/emrlogs\/",
"Name": "emrFromR",
"BootstrapActions": [
{
"BootstrapActionConfig": {
...
如果我尝试通过rjson运行系统调用的结果,我会收到此错误:
Error: '\/' is an unrecognized escape in character string starting "s3n:\/"
我相信这是因为s3n系列的双重逃逸。我正在努力将这个文本按摩到可以解析的东西。
可能就像用“\”替换“\\”一样简单,但由于我有点与正则表达式和逃避斗争,我无法正确完成。
那么如何获取字符串向量并用“\”替换任何出现的“\\”? (即使输入这个问题我也不得不使用三个反斜杠代表两个)与这个特定用例有关的任何其他提示?
这是我的代码更详细:
> library(rjson)
> emrJson <- paste(system("~/EMR/elastic-mapreduce --describe --jobflow j-2H9P770Z4B8GG", intern=T))
>
> parser <- newJSONParser()
> for (i in 1:length(emrJson)){
+ parser$addData(emrJson[i])
+ }
>
> parser$getObject()
Error: '\/' is an unrecognized escape in character string starting "s3n:\/"
如果你想重新创建emrJson对象,这里是dput()输出:
> dput(emrJson)
c("{", " \"JobFlows\": [", " {", " \"LogUri\": \"s3n:\\/\\/emrlogs\\/\",",
" \"Name\": \"emrFromR\",", " \"BootstrapActions\": [",
" {", " \"BootstrapActionConfig\": {", " \"Name\": \"Bootstrap 0\",",
" \"ScriptBootstrapAction\": {", " \"Path\": \"s3:\\/\\/rtmpfwblrx\\/bootstrap.sh\",",
" \"Args\": []", " }", " }",
" }", " ],", " \"ExecutionStatusDetail\": {",
" \"EndDateTime\": 1278124414.0,", " \"CreationDateTime\": 1278123795.0,",
" \"LastStateChangeReason\": \"Steps completed\",", " \"State\": \"COMPLETED\",",
" \"StartDateTime\": 1278124000.0,", " \"ReadyDateTime\": 1278124237.0",
" },", " \"Steps\": [", " {", " \"StepConfig\": {",
" \"ActionOnFailure\": \"CANCEL_AND_WAIT\",", " \"Name\": \"Example Streaming Step\",",
" \"HadoopJarStep\": {", " \"MainClass\": null,",
" \"Jar\": \"\\/home\\/hadoop\\/contrib\\/streaming\\/hadoop-0.18-streaming.jar\",",
" \"Args\": [", " \"-input\",", " \"s3n:\\/\\/rtmpfwblrx\\/stream.txt\",",
" \"-output\",", " \"s3n:\\/\\/rtmpfwblrxout\\/\",",
" \"-mapper\",", " \"s3n:\\/\\/rtmpfwblrx\\/mapper.R\",",
" \"-reducer\",", " \"cat\",",
" \"-cacheFile\",", " \"s3n:\\/\\/rtmpfwblrx\\/emrData.RData#emrData.RData\"",
" ],", " \"Properties\": []", " }",
" },", " \"ExecutionStatusDetail\": {", " \"EndDateTime\": 1278124322.0,",
" \"CreationDateTime\": 1278123795.0,", " \"LastStateChangeReason\": null,",
" \"State\": \"COMPLETED\",", " \"StartDateTime\": 1278124232.0",
" }", " }", " ],", " \"JobFlowId\": \"j-2H9P770Z4B8GG\",",
" \"Instances\": {", " \"Ec2KeyName\": \"JL 09282009\",",
" \"InstanceCount\": 2,", " \"Placement\": {",
" \"AvailabilityZone\": \"us-east-1d\"", " },",
" \"KeepJobFlowAliveWhenNoSteps\": false,", " \"SlaveInstanceType\": \"m1.small\",",
" \"MasterInstanceType\": \"m1.small\",", " \"MasterPublicDnsName\": \"ec2-174-129-70-89.compute-1.amazonaws.com\",",
" \"MasterInstanceId\": \"i-2147b84b\",", " \"InstanceGroups\": null,",
" \"HadoopVersion\": \"0.18\"", " }", " }", " ]",
"}")
答案 0 :(得分:2)
一般规则似乎是使用你认为需要的反斜杠数量的两倍(现在找不到来源)。
emrJson <- gsub("\\\\", "\\", emrJson)
parser <- newJSONParser()
for (i in 1:length(emrJson)){
parser$addData(emrJson[i])
}
parser$getObject()
在这里使用你的输出输出。
答案 1 :(得分:0)
我不确定它是双重逃脱的。请记住,您需要使用'cat'来查看字符串是什么,而不是字符串的表示形式。