R:读取并解析Json

时间:2016-10-14 14:00:49

标签: json r text-parsing

如果R不适合这份工作那么公平,但我相信应该是。

我正在调用API,然后将结果转储到Postman json阅读器中。然后我得到的结果如下:

 "results": [
    {
      "personUuid": "***",
      "synopsis": {
        "fullName": "***",
        "headline": "***",
        "location": "***",
        "image": "***",
        "skills": [
          "*",
          "*",
          "*",
          "*.",
          "*"
        ],
        "phoneNumbers": [
          "***",
          "***"
        ],
        "emailAddresses": [
          "***"
        ],
        "networks": [
          {
            "name": "linkedin",
            "url": "***",
            "type": "canonicalUrl",
            "lastAccessed": null
          },
          {
            "name": "***",
            "url": "***",
            "type": "cvUrl",
            "lastAccessed": "*"
          },
          {
            "name": "*",
            "url": "***",
            "type": "cvUrl",
            "lastAccessed": "*"
          }
        ]
      }
    },
    {

首先,我不确定如何将其导入R,因为我主要处理csv。我已经看到其他问题,其中人们使用Json包直接调用URL,但这不适用于我正在做的事情,所以我想知道如何用json读取csv。

我用过:

x <- fromJSON(file="Z:/json.csv")

但也许这是更好的方式。一旦完成,json看起来更像:

...$results[[9]]$synopsis$emailAddresses
[1] "***" "***"          
[3] "***"                "***"          

$results[[9]]$synopsis$networks...

然后我想要的每个结果是将标题和电子邮件地址存储到数据表中。

我试过了:

str_extract_all(x, 'emailAddresses*$')

但是,我认为*将代表emailAddresses和$包括新行等之间的所有内容,但这不起作用。当你得到*工作时,我也发现提取物,它不会提取*代表什么。

例如:

> y <- 'some text. email "oli@oli.o" other text'
> y
[1] "some text. email \"oli@oli.o\" other text"
> str_extract_all(y, 'email \"*"')
[[1]]
[1] "email \""

第2部分:

以下答案有效,但是如果我直接打电话给api:

body ='{"start": 0,"count": 105,...}'

x <- POST(url="https://live.*.me/api/v3/person", body=body, add_headers(Accept="application/json", 'Content-Type'="application/json", Authorization = "id=*, apiKey=*"))

y <- content(x)

然后使用

fromJSON(y, flatten=TRUE)$results[c("synopsis.headline",  
                                            "synopsis.emailAddresses")]

不起作用。我尝试了以下方法:

z <- NULL
zz <- NULL

for(i in 1:y$count){
     z=rbind(z,data.table(job = y$results[[i]]$synopsis$headline))
 }
 for(i in 1:y$count){
       zz=rbind(zz,data.table(job = y$results[[i]]$synopsis$emailAddresses))
   }
df <- cbind(z,zz)

但是,当返回JSON列表时,有些人会收到多封电子邮件。因此,上述方法仅记录每个人的第一封电子邮件,如何将多封电子邮件保存为矢量(而不是多列)?

2 个答案:

答案 0 :(得分:2)

更新1: 要从URL读取json,您只需使用fromJSON函数,使用您的json数据url传递字符串:

library(jsonlite)

url <- 'http://you.url.com/data.json'

# in this case we pass an URL to the fromJSON function instead of the actual content we want to parse
fromJSON(url, flatten=TRUE)$results[c("synopsis.headline", "synopsis.emailAddresses")] 

// end UPDATE 1

你也可以将展平参数传递给fromJSON,然后使用'results'数据框。

fromJSON(json.data, flatten=TRUE)$results[c("synopsis.headline",  
                                            "synopsis.emailAddresses")]

synopsis.headline synopsis.emailAddresses
1               ***        jane.doe@boo.com
2               ***        john.doe@foo.com

这里是我如何定义json.data,请注意我故意在您的示例输入json中添加了1条记录。

 json.data <- '{
      "results":[  
        {  
          "personUuid":"***",
          "synopsis":{  
            "fullName":"***",
            "headline":"***",
            "location":"***",
            "image":"***",
            "skills":[  
              "*",
              "*",
              "*",
              "*.",
              "*"
              ],
            "phoneNumbers":[  
              "***",
              "***"
              ],
            "emailAddresses":[  
              "jane.doe@boo.com"
              ],
            "networks":[  
              {  
                "name":"linkedin",
                "url":"***",
                "type":"canonicalUrl",
                "lastAccessed":null
              },
              {  
                "name":"***",
                "url":"***",
                "type":"cvUrl",
                "lastAccessed":"*"
              },
              {  
                "name":"*",
                "url":"***",
                "type":"cvUrl",
                "lastAccessed":"*"
              }
              ]
          }
        },
        {  
          "personUuid":"***",
          "synopsis":{  
            "fullName":"***",
            "headline":"***",
            "location":"***",
            "image":"***",
            "skills":[  
              "*",
              "*",
              "*",
              "*.",
              "*"
              ],
            "phoneNumbers":[  
              "***",
              "***"
              ],
            "emailAddresses":[  
              "john.doe@foo.com"
              ],
            "networks":[  
              {  
                "name":"linkedin",
                "url":"***",
                "type":"canonicalUrl",
                "lastAccessed":null
              },
              {  
                "name":"***",
                "url":"***",
                "type":"cvUrl",
                "lastAccessed":"*"
              },
              {  
                "name":"*",
                "url":"***",
                "type":"cvUrl",
                "lastAccessed":"*"
              }
              ]
          }
        }
        ]
    }'

答案 1 :(得分:1)

其他测试数据可能会有所帮助。

考虑:

library(jsonlite)
library(dplyr)

json_data = "{\"results\": [\n    {\n\"personUuid\": \"***\",\n\"synopsis\": {\n\"fullName\": \"***\",\n\"headline\": \"***\",\n\"location\": \"***\",\n\"image\": \"***\",\n\"skills\": [\n\"*\",\n\"*\",\n\"*\",\n\"*.\",\n\"*\"\n],\n\"phoneNumbers\": [\n\"***\",\n\"***\"\n],\n\"emailAddresses\": [\n\"***\"\n],\n\"networks\": [\n{\n  \"name\": \"linkedin\",\n  \"url\": \"***\",\n  \"type\": \"canonicalUrl\",\n  \"lastAccessed\": null\n},\n  {\n  \"name\": \"***\",\n  \"url\": \"***\",\n  \"type\": \"cvUrl\",\n  \"lastAccessed\": \"*\"\n  },\n  {\n  \"name\": \"*\",\n  \"url\": \"***\",\n  \"type\": \"cvUrl\",\n  \"lastAccessed\": \"*\"\n  }\n  ]\n}\n}]}"

(df <- jsonlite::fromJSON(json_data, simplifyDataFrame = TRUE, flatten = TRUE))
#> $results
#>   personUuid synopsis.fullName synopsis.headline synopsis.location
#> 1        ***               ***               ***               ***
#>   synopsis.image synopsis.skills synopsis.phoneNumbers
#> 1            ***  *, *, *, *., *              ***, ***
#>   synopsis.emailAddresses
#> 1                     ***
#>                                                       synopsis.networks
#> 1 linkedin, ***, *, ***, ***, ***, canonicalUrl, cvUrl, cvUrl, NA, *, *

df$results %>%
  select(headline = synopsis.headline, emails = synopsis.emailAddresses)
#>   headline emails
#> 1      ***    ***