R:替换双重转义文本

时间:2010-07-05 03:18:08

标签: regex r elastic-map-reduce

我正在使用Amazon Elastic Map Reduce命令行工具将许多系统调用粘在一起。这些命令返回已经(部分?)转义的JSON文本。然后,当系统调用将其转换为R文本对象(实习生= T)时,它似乎再次被转义。我需要清理它,以便它将使用rjson包进行解析。

我以这种方式进行系统调用:

system("~/EMR/elastic-mapreduce --describe --jobflow j-2H9P770Z4B8GG", intern=T)

返回:

 [1] "{"                                                                                             
 [2] "  \"JobFlows\": ["                                                                             
 [3] "    {"                                                                                         
 [4] "      \"LogUri\": \"s3n:\\/\\/emrlogs\\/\","                                                   
 [5] "      \"Name\": \"emrFromR\","                                                                 
 [6] "      \"BootstrapActions\": [" 
...

但命令行中的相同命令返回:

{
  "JobFlows": [
    {
      "LogUri": "s3n:\/\/emrlogs\/",
      "Name": "emrFromR",
      "BootstrapActions": [
        {
          "BootstrapActionConfig": {
...

如果我尝试通过rjson运行系统调用的结果,我会收到此错误:

Error: '\/' is an unrecognized escape in character string starting "s3n:\/"

我相信这是因为s3n系列的双重逃逸。我正在努力将这个文本按摩到可以解析的东西。

可能就像用“\”替换“\\”一样简单,但由于我有点与正则表达式和逃避斗争,我无法正确完成。

那么如何获取字符串向量并用“\”替换任何出现的“\\”? (即使输入这个问题我也不得不使用三个反斜杠代表两个)与这个特定用例有关的任何其他提示?

这是我的代码更详细:

> library(rjson)
> emrJson <- paste(system("~/EMR/elastic-mapreduce --describe --jobflow j-2H9P770Z4B8GG", intern=T))
> 
>     parser <- newJSONParser()
>     for (i in 1:length(emrJson)){
+       parser$addData(emrJson[i])
+     }
> 
> parser$getObject()
Error: '\/' is an unrecognized escape in character string starting "s3n:\/"

如果你想重新创建emrJson对象,这里是dput()输出:

> dput(emrJson)
c("{", "  \"JobFlows\": [", "    {", "      \"LogUri\": \"s3n:\\/\\/emrlogs\\/\",", 
"      \"Name\": \"emrFromR\",", "      \"BootstrapActions\": [", 
"        {", "          \"BootstrapActionConfig\": {", "            \"Name\": \"Bootstrap 0\",", 
"            \"ScriptBootstrapAction\": {", "              \"Path\": \"s3:\\/\\/rtmpfwblrx\\/bootstrap.sh\",", 
"              \"Args\": []", "            }", "          }", 
"        }", "      ],", "      \"ExecutionStatusDetail\": {", 
"        \"EndDateTime\": 1278124414.0,", "        \"CreationDateTime\": 1278123795.0,", 
"        \"LastStateChangeReason\": \"Steps completed\",", "        \"State\": \"COMPLETED\",", 
"        \"StartDateTime\": 1278124000.0,", "        \"ReadyDateTime\": 1278124237.0", 
"      },", "      \"Steps\": [", "        {", "          \"StepConfig\": {", 
"            \"ActionOnFailure\": \"CANCEL_AND_WAIT\",", "            \"Name\": \"Example Streaming Step\",", 
"            \"HadoopJarStep\": {", "              \"MainClass\": null,", 
"              \"Jar\": \"\\/home\\/hadoop\\/contrib\\/streaming\\/hadoop-0.18-streaming.jar\",", 
"              \"Args\": [", "                \"-input\",", "                \"s3n:\\/\\/rtmpfwblrx\\/stream.txt\",", 
"                \"-output\",", "                \"s3n:\\/\\/rtmpfwblrxout\\/\",", 
"                \"-mapper\",", "                \"s3n:\\/\\/rtmpfwblrx\\/mapper.R\",", 
"                \"-reducer\",", "                \"cat\",", 
"                \"-cacheFile\",", "                \"s3n:\\/\\/rtmpfwblrx\\/emrData.RData#emrData.RData\"", 
"              ],", "              \"Properties\": []", "            }", 
"          },", "          \"ExecutionStatusDetail\": {", "            \"EndDateTime\": 1278124322.0,", 
"            \"CreationDateTime\": 1278123795.0,", "            \"LastStateChangeReason\": null,", 
"            \"State\": \"COMPLETED\",", "            \"StartDateTime\": 1278124232.0", 
"          }", "        }", "      ],", "      \"JobFlowId\": \"j-2H9P770Z4B8GG\",", 
"      \"Instances\": {", "        \"Ec2KeyName\": \"JL 09282009\",", 
"        \"InstanceCount\": 2,", "        \"Placement\": {", 
"          \"AvailabilityZone\": \"us-east-1d\"", "        },", 
"        \"KeepJobFlowAliveWhenNoSteps\": false,", "        \"SlaveInstanceType\": \"m1.small\",", 
"        \"MasterInstanceType\": \"m1.small\",", "        \"MasterPublicDnsName\": \"ec2-174-129-70-89.compute-1.amazonaws.com\",", 
"        \"MasterInstanceId\": \"i-2147b84b\",", "        \"InstanceGroups\": null,", 
"        \"HadoopVersion\": \"0.18\"", "      }", "    }", "  ]", 
"}")

2 个答案:

答案 0 :(得分:2)

一般规则似乎是使用你认为需要的反斜杠数量的两倍(现在找不到来源)。

emrJson <- gsub("\\\\", "\\", emrJson)
parser <- newJSONParser()
for (i in 1:length(emrJson)){
    parser$addData(emrJson[i])
}
parser$getObject()

在这里使用你的输出输出。

答案 1 :(得分:0)

我不确定它是双重逃脱的。请记住,您需要使用'cat'来查看字符串是什么,而不是字符串的表示形式。