将JSON文件数据导入R作为NLP的数据帧

时间:2017-12-02 06:13:44

标签: json r parsing

我正在尝试将数据从JSON文件导入到R中,以便尝试自然语言处理。这些数据是在用markdown编写的博客中解析和提取的。问题是R中的导入是作为列表和有趣的格式导入的,我无法弄清楚如何将其导入数据框。这是我的JSON文件或导入过程的问题吗?

示例数据:

{
  "2017-11-17-blog-post-01": {
    "title": "Blog Post 01",
    "layout": "post",
    "categories": [
      "Category1",
      "Category2"
    ],
    "comments": true,
    "published": true,
    "permalink": "/blog-post-01.html",
    "basename": "2017-11-17-blog-post-01"
  },
  "2017-11-30-blog-post-02": {
    "title": "Blog Post 2",
    "layout": "post",
    "categories": [
      "Category2",
      "Category3"
    ],
    "comments": true,
    "published": true,
    "permalink": "/2017-11-30-blog-post-02.html",
    "basename": "2017-11-30-blog-post-02"
  }
}

命令:

library(jsonlite)
import <- fromJSON("test-import.json", flatten=TRUE)

结果:

$`2017-11-17-blog-post-01`
$`2017-11-17-blog-post-01`$title
[1] "Blog Post 01"

$`2017-11-17-blog-post-01`$layout
[1] "post"

$`2017-11-17-blog-post-01`$categories
[1] "Category1" "Category2"

$`2017-11-17-blog-post-01`$comments
[1] TRUE

$`2017-11-17-blog-post-01`$published
[1] TRUE

$`2017-11-17-blog-post-01`$permalink
[1] "/blog-post-01.html"

$`2017-11-17-blog-post-01`$basename
[1] "2017-11-17-blog-post-01"


$`2017-11-30-blog-post-02`
$`2017-11-30-blog-post-02`$title
[1] "Blog Post 2"

$`2017-11-30-blog-post-02`$layout
[1] "post"

$`2017-11-30-blog-post-02`$categories
[1] "Category2" "Category3"

$`2017-11-30-blog-post-02`$comments
[1] TRUE

$`2017-11-30-blog-post-02`$published
[1] TRUE

$`2017-11-30-blog-post-02`$permalink
[1] "/2017-11-30-blog-post-02.html"

$`2017-11-30-blog-post-02`$basename
[1] "2017-11-30-blog-post-02"

1 个答案:

答案 0 :(得分:1)

library(purrr)

您的数据:

jsonlite::fromJSON('{
  "2017-11-17-blog-post-01": {
    "title": "Blog Post 01",
    "layout": "post",
    "categories": [
      "Category1",
      "Category2"
    ],
    "comments": true,
    "published": true,
    "permalink": "/blog-post-01.html",
    "basename": "2017-11-17-blog-post-01"
  },
  "2017-11-30-blog-post-02": {
    "title": "Blog Post 2",
    "layout": "post",
    "categories": [
      "Category2",
      "Category3"
    ],
    "comments": true,
    "published": true,
    "permalink": "/2017-11-30-blog-post-02.html",
    "basename": "2017-11-30-blog-post-02"
  }
}', flatten=TRUE) -> jsdat

flatten=TRUE大部分时间都有用,但我认为categories会导致它不能自动为您制作数据框,所以我们可以帮忙:

map_df(jsdat, ~{
  .x$categories <- list(.x$categories)
  .x
}, .id="id")

## # A tibble: 2 x 8
##                        id        title layout categories comments published                     permalink                basename
##                     <chr>        <chr>  <chr>     <list>    <lgl>     <lgl>                         <chr>                   <chr>
## 1 2017-11-17-blog-post-01 Blog Post 01   post  <chr [2]>     TRUE      TRUE            /blog-post-01.html 2017-11-17-blog-post-01
## 2 2017-11-30-blog-post-02  Blog Post 2   post  <chr [2]>     TRUE      TRUE /2017-11-30-blog-post-02.html 2017-11-30-blog-post-02