Question

我有一个包含用户名和推文文字的推文的大表。推文文本经常提到（@username）。我想提取所有用户名并为社交网络分析构建一个新表，其中每一行都有一个提及。

示例行：

|-------------------|--------------------------------------|
|      username     |     tweet                            |
|-------------------|--------------------------------------|
|      userA        |     great stuff @userC and @userB    |
|-------------------|--------------------------------------|
|      userB        |     thanks for mentioning @userE     |
|-------------------|--------------------------------------|

导致：

 |-------------------|--------------------------------------|
 |      tweet_by     |     repied_to                        |
 |-------------------|--------------------------------------|
 |      userA        |     userC                            |
 |-------------------|--------------------------------------|
 |      userA        |     userB                            |
 |-------------------|--------------------------------------|
 |      userB        |     userE                            |
 |-------------------|--------------------------------------|

我发现了这个问题，但我无法找到split（）和regexp_extract的解决方案：Transform data in Google bigquery - extract text, split it into multiple columns and pivoting the data

Answer 1

尝试以下简单选项。它应该起作用，因为我认为你的提取标准非常简单。除非你想处理一些边缘情况

SELECT 
  username AS tweet_by, 
  SPLIT(tweet, ' ') AS repied_to 
FROM YourTable
HAVING LEFT(repied_to, 1) = '@'

已添加 - 用于解决潜在用例，例如

userA     great stuff @userC&@userB  
userB     thanks for mentioning  @userE!  
userC     great stuff  @userC,@userB

查询

SELECT
  tweet_by,
  REPLACE(word, '@', '') AS repied_to
FROM (  
  SELECT 
    username AS tweet_by,
    SPLIT(REGEXP_REPLACE(tweet, r'([^\w@])', ' '), ' ') AS word 
  FROM YourTable
  HAVING LEFT(word, 1) = '@'
)

Bigquery sql查询从推文中提取Twitter用户名

1 个答案: