我有一个包含用户名和推文文字的推文的大表。推文文本经常提到(@username)。我想提取所有用户名并为社交网络分析构建一个新表,其中每一行都有一个提及。
示例行:
|-------------------|--------------------------------------|
| username | tweet |
|-------------------|--------------------------------------|
| userA | great stuff @userC and @userB |
|-------------------|--------------------------------------|
| userB | thanks for mentioning @userE |
|-------------------|--------------------------------------|
导致:
|-------------------|--------------------------------------|
| tweet_by | repied_to |
|-------------------|--------------------------------------|
| userA | userC |
|-------------------|--------------------------------------|
| userA | userB |
|-------------------|--------------------------------------|
| userB | userE |
|-------------------|--------------------------------------|
我发现了这个问题,但我无法找到split()和regexp_extract的解决方案:Transform data in Google bigquery - extract text, split it into multiple columns and pivoting the data
答案 0 :(得分:1)
尝试以下简单选项。它应该起作用,因为我认为你的提取标准非常简单。除非你想处理一些边缘情况
SELECT
username AS tweet_by,
SPLIT(tweet, ' ') AS repied_to
FROM YourTable
HAVING LEFT(repied_to, 1) = '@'
已添加 - 用于解决潜在用例,例如
userA great stuff @userC&@userB
userB thanks for mentioning @userE!
userC great stuff @userC,@userB
查询
SELECT
tweet_by,
REPLACE(word, '@', '') AS repied_to
FROM (
SELECT
username AS tweet_by,
SPLIT(REGEXP_REPLACE(tweet, r'([^\w@])', ' '), ' ') AS word
FROM YourTable
HAVING LEFT(word, 1) = '@'
)