如何为Big Query准备Google Natural Language Proscessing输出(json)

时间:2016-10-25 20:12:03

标签: sql json nlp google-bigquery google-cloud-platform

我试图在Big Query(BQ)中查询自然语言处理(NLP)调用的输出,但我很难以正确的格式获取BQ的输出。

我理解BQ采用json文件(作为换行符分隔) - 但不确定(a)NLP的输出是json换行符分隔符和(b)我的架构是否正确。

这是我正在使用的json输出:

{
  "entities": [
    {
      "name": "Rowling",
      "type": "PERSON",
      "metadata": {
        "wikipedia_url": "http://en.wikipedia.org/wiki/J._K._Rowling"
      },
      "salience": 0.65751493,
      "mentions": [
        {
          "text": {
            "content": "   J.",
            "beginOffset": -1
          }
        },
        {
          "text": {
            "content": "K. Rowl",
            "beginOffset": -1
          }
        }
      ]
    },
    {
      "name": "LONDON",
      "type": "LOCATION",
      "metadata": {
        "wikipedia_url": "http://en.wikipedia.org/wiki/London"
      },
      "salience": 0.14284456,
      "mentions": [
        {
          "text": {
            "content": "\ufeffLON",
            "beginOffset": -1
          }
        }
      ]
    },
    {
      "name": "Harry Potter",
      "type": "WORK_OF_ART",
      "metadata": {
        "wikipedia_url": "http://en.wikipedia.org/wiki/Harry_Potter"
      },
      "salience": 0.0726779,
      "mentions": [
        {
          "text": {
            "content": "th Harry Pot",
            "beginOffset": -1
          }
        },
        {
          "text": {
            "content": "‘Harry Pot",
            "beginOffset": -1
          }
        }
      ]
    },
    {
      "name": "Deathly Hallows",
      "type": "WORK_OF_ART",
      "metadata": {
        "wikipedia_url": "http://en.wikipedia.org/wiki/Harry_Potter_and_the_Deathly_Hallows"
      },
      "salience": 0.022565609,
      "mentions": [
        {
          "text": {
            "content": "he Deathly Hall",
            "beginOffset": -1
          }
        }
      ]
    }
  ],
  "language": "en"
}

有没有办法通过Google Cloud shell中的命令行直接将输出发送到大查询?

非常感谢任何信息!

由于

2 个答案:

答案 0 :(得分:2)

很高兴你找到了我的哈利波特博客文章!我建议将NL API的JSON响应存储为BigQuery中的字符串,然后使用user-defined function进行查询。您应该能够运行以下内容(该表是可公开查看的),以计算每个实体在您发布的JSON中出现的频率:

SELECT 
  COUNT(*) as entity_count, entity
FROM 
  JS(
    (SELECT entities FROM [sara-bigquery:samples.hp_udf]),
    entities,
    "[{ name: 'entity', type: 'string'}]",
    "function(row, emit) { 
      try {
        x = JSON.parse(row.entities);
        entities = x['entities'];
        entities.forEach(function(data) {
          emit({ entity: data.name });
        });
      } catch (e) {}
    }" 
  )
GROUP BY entity
ORDER BY entity_count DESC

答案 1 :(得分:1)

  

通过Google Cloud shell中的命令行将输出直接发送到大查询

查看此页面,搜索“bq load” https://cloud.google.com/bigquery/bq-command-line-tool

这里有一些关于json架构的例子。 Schema to load json data to google big query