如何将字符串列拆分为多行单个单词& BigQuery SQL中的单词对?

时间:2018-03-21 14:53:37

标签: google-bigquery standard-sql legacy-sql

我正在尝试(不成功)将Google BigQuery中的字符串列拆分为包含所有单个单词和所有单词对的行(彼此相邻且按顺序排列)。我还需要维护IndataTable中单词的ID字段。两个记录集都有2列。

IndataTable作为IDT
ID WordString
1个苹果香蕉梨
2胡萝卜
3蓝红绿黄

OutdataTable为ODT
ID WordString
1个苹果
1个香蕉
1梨 1个苹果香蕉
1香蕉梨
2胡萝卜
3蓝色
3红色
3绿色
3黄色
3蓝红色
3红绿
3绿黄色(只有彼此相邻的对)

这在BigQuery SQL中是否可行?

编辑/加了:
这就是我到目前为止用于将其分成单个单词的方法。我真的很想弄清楚如何将其扩展为单词对。我不知道是否可以修改它,或者我需要一个新的方法。

SELECT ID, split(WordString,' ') as Words
FROM (
  select * 
     from 
     (select ID, WordString from IndataTable)
)

1 个答案:

答案 0 :(得分:1)

以下是BigQuery Standard SQL

   
scala> for(i <- List("a" ,"b" )){
     | names = i :: names } 

scala> names
res11: List[String] = List(b, a)

结果符合预期:

#standardSQL
WITH IndataTable AS (
  SELECT 1 id, 'apple banana pear' WordString UNION ALL
  SELECT 2, 'carrot' UNION ALL
  SELECT 3, 'blue red green yellow' 
), words AS (
  SELECT id, word, pos
  FROM IndataTable, UNNEST(SPLIT(WordString,' ')) AS Word WITH OFFSET pos
), pairs AS (
  SELECT id, CONCAT(word, ' ', LEAD(word) OVER(PARTITION BY id ORDER BY pos)) pair
  FROM words
)
SELECT id, word AS WordString FROM words UNION ALL
SELECT id, pair AS WordString FROM pairs
WHERE NOT pair IS NULL
ORDER BY id