Bigquery sql查询从推文中提取Twitter用户名

时间:2016-04-01 15:39:14

标签: sql twitter google-bigquery

我有一个包含用户名和推文文字的推文的大表。推文文本经常提到(@username)。我想提取所有用户名并为社交网络分析构建一个新表,其中每一行都有一个提及。

示例行:

|-------------------|--------------------------------------|
|      username     |     tweet                            |
|-------------------|--------------------------------------|
|      userA        |     great stuff @userC and @userB    |
|-------------------|--------------------------------------|
|      userB        |     thanks for mentioning @userE     |
|-------------------|--------------------------------------|

导致:

 |-------------------|--------------------------------------|
 |      tweet_by     |     repied_to                        |
 |-------------------|--------------------------------------|
 |      userA        |     userC                            |
 |-------------------|--------------------------------------|
 |      userA        |     userB                            |
 |-------------------|--------------------------------------|
 |      userB        |     userE                            |
 |-------------------|--------------------------------------|

我发现了这个问题,但我无法找到split()和regexp_extract的解决方案:Transform data in Google bigquery - extract text, split it into multiple columns and pivoting the data

1 个答案:

答案 0 :(得分:1)

尝试以下简单选项。它应该起作用,因为我认为你的提取标准非常简单。除非你想处理一些边缘情况

SELECT 
  username AS tweet_by, 
  SPLIT(tweet, ' ') AS repied_to 
FROM YourTable
HAVING LEFT(repied_to, 1) = '@'
  

已添加 - 用于解决潜在用例,例如

userA     great stuff @userC&@userB  
userB     thanks for mentioning  @userE!  
userC     great stuff  @userC,@userB  

查询

SELECT
  tweet_by,
  REPLACE(word, '@', '') AS repied_to
FROM (  
  SELECT 
    username AS tweet_by,
    SPLIT(REGEXP_REPLACE(tweet, r'([^\w@])', ' '), ' ') AS word 
  FROM YourTable
  HAVING LEFT(word, 1) = '@'
)