我正在使用Big Query的Hacker News数据集,并正在研究哪些URL具有最多的新闻故事。我也想删除域名,然后看看其中哪些新闻报道最多。我正在R中工作,但在使以下查询正常工作时遇到了一些麻烦。
# Select the ten domains that have the most stories
sql_domain <- "SELECT url REPLACE(CASE WHEN REGEXP_CONTAINS(url, '//')
THEN url ELSE CONCAT('http://', url) END, '&', '?') as domain_name,
COUNT(domain_name) as story_number
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
GROUP BY domain_name
ORDER BY story_number DESC
LIMIT 10"
我不需要剥离顶级域名;例如,不需要stackoverflow
,就可以使用stackoverflow.com
。非常感谢您的帮助!
答案 0 :(得分:3)
问题出在查询中-您应按以下方式使用(适用于BigQuery Standard SQL)
SELECT
NET.REG_DOMAIN(url) AS domain_name,
COUNT(NET.REG_DOMAIN(url)) AS story_number
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
GROUP BY 1
ORDER BY story_number DESC
LIMIT 10
这将为您提供以下
Row domain_name story_number
1 github.com 81784
2 medium.com 71953
3 youtube.com 58119
4 blogspot.com 52925
5 nytimes.com 48986
6 techcrunch.com 43924
7 google.com 26326
8 wordpress.com 23372
9 arstechnica.com 23162
10 wired.com 18480