我正在使用Google的BigQuery从GDELT数据库中提取数据,以提取包含特定国家/地区的新闻的平均基调。我有一个有效的SQL查询
SELECT date,
Avg(Float(tone)) tone
FROM (
SELECT integer(regexp_replace(String(date), r'\d{6}$', '')) date,
regexp_replace(v2tone, r',.*', '') tone,
FROM [gdelt-bq:gdeltv2.gkg_partitioned]
WHERE _partition_load_time BETWEEN timestamp('2016-07-06') AND timestamp('2016-07-07')
AND (
v2locations LIKE '%Spain%'))
GROUP BY date,
country
ORDER BY country,
date
但我需要为很多不同的国家/地区进行这些查询,所以我想也许我可以在一个查询中下载所有数据,而且我想我几乎就在那里。有两个国家的例子:
SELECT date,
country,
Avg(Float(tone)) tone
FROM (
SELECT integer(regexp_replace(String(date), r'\d{6}$', '')) date,
regexp_replace(v2tone, r',.*', '') tone,
regexp_extract(v2locations, r'(Spain|Chile)') country
FROM [gdelt-bq:gdeltv2.gkg_partitioned]
WHERE _partition_load_time BETWEEN timestamp('2016-07-06') AND timestamp('2016-07-07')
AND (
v2locations LIKE '%Spain%'
OR v2locations LIKE '%Chile%'))
GROUP BY date,
country
ORDER BY country,
date
现在,问题在于,对于智利和西班牙的条目,需要在西班牙的两个组中取平均值和智利。正如代码现在,我得到 Chile 的正确结果,因为它首先按字母顺序排列,但显然西班牙的结果是错误的,因为这两个国家的条目都有被平均分配到 Chile 组。
我的问题是:如何在两个组的V2Locations列中对包含 Spain 和 Chile 这两个词的条目进行分组?有可能吗?
编辑: 虽然下面的答案确实回答了我的问题,但它们也会带来相当高的成本。那么我现在正在做的是在本地进行聚合,以便能够利用GDELT数据库的分区特性。也就是说,我提出了各国不同交叉点的平均基调以及观测数量。这允许在本地(而不是在BigQuery上)计算实际的国家/地区。随着数据量的增加,这需要很长的时间来计算,但它可以节省相当多的成本,并且允许以零额外成本提取一个额外国家的数据。
SELECT date,
concat(IF(regexp_match(country,'Cataluna'),'Cataluna',''),
IF(regexp_match(country,'Chile'),'Chile',''),'') country,
AVG(FLOAT(tone)) Tone,
count(tone) num,
FROM (
SELECT INTEGER(REGEXP_REPLACE(STRING(DATE), r'\d{6}$', '')) date,
REGEXP_REPLACE(V2Tone, r',.*', '') tone,
V2Locations country,
FROM [gdelt-bq:gdeltv2.gkg_partitioned]
WHERE _PARTITION_LOAD_TIME BETWEEN TIMESTAMP('2016-05-01')
AND TIMESTAMP('2016-10-23')
AND (V2Locations like '%Cataluna%'
OR V2Locations like '%Chile%'))
GROUP BY date, country
ORDER BY country, date
答案 0 :(得分:0)
这很棘手。以下是一种使用JOIN
到派生表来查找与国家匹配的方法:
SELECT date, c.country, AVG(FLOAT(tone)) Tone
FROM (SELECT INTEGER(REGEXP_REPLACE(STRING(DATE), r'\d{6}$', '')) as date,
REGEXP_REPLACE(V2Tone, r',.*', '') tone,
c.country
FROM [gdelt-bq:gdeltv2.gkg_partitioned] gkg CROSS JOIN
(SELECT 'Chile' as country UNION ALL
SELECT 'Spain' as country
) c
ON gkg.V2Locations LIKE CONCAT('%', c.country, '%')
WHERE _PARTITION_LOAD_TIME BETWEEN TIMESTAMP('2016-07-06') AND
TIMESTAMP('2016-07-07')
) x
GROUP BY date, country
ORDER BY country, date
答案 1 :(得分:0)
尝试以下(BigQuery传统SQL模式)
class InSocket: NSObject, GCDAsyncUdpSocketDelegate {
let IP = "255.255.255.255"
let PORT:UInt16 = 3520
var socket:GCDAsyncUdpSocket!
override init(){
super.init()
setupConnection()
}
func setupConnection(){
let error : NSError?
socket = GCDAsyncUdpSocket(delegate: self, delegateQueue: dispatch_get_main_queue())
do {
try socket.connectToHost(IP, onPort: PORT)
try socket.beginReceiving()
}catch let cError as NSCocoaError{
print(cError)
} catch {
print(error)
}
}
func udpSocket(sock: GCDAsyncUdpSocket, didReceiveData data: NSData, fromAddress address: NSData, withFilterContext filterContext: AnyObject?) {
print("incoming message: \(data)");
}
}