将一个条目聚合到多个组

时间:2016-08-17 11:55:11

标签: sql google-bigquery

我正在使用Google的BigQuery从GDELT数据库中提取数据,以提取包含特定国家/地区的新闻的平均基调。我有一个有效的SQL查询

SELECT   date, 
         Avg(Float(tone)) tone 
FROM     ( 
                SELECT integer(regexp_replace(String(date), r'\d{6}$', '')) date, 
                       regexp_replace(v2tone, r',.*', '')                   tone, 
                FROM   [gdelt-bq:gdeltv2.gkg_partitioned] 
                WHERE  _partition_load_time BETWEEN timestamp('2016-07-06') AND    timestamp('2016-07-07') 
                AND    ( 
                              v2locations LIKE '%Spain%')) 
GROUP BY date, 
         country 
ORDER BY country, 
         date

但我需要为很多不同的国家/地区进行这些查询,所以我想也许我可以在一个查询中下载所有数据,而且我想我几乎就在那里。有两个国家的例子:

SELECT   date, 
         country, 
         Avg(Float(tone)) tone 
FROM     ( 
                SELECT integer(regexp_replace(String(date), r'\d{6}$', '')) date, 
                       regexp_replace(v2tone, r',.*', '')                   tone, 
                       regexp_extract(v2locations, r'(Spain|Chile)')        country 
                FROM   [gdelt-bq:gdeltv2.gkg_partitioned] 
                WHERE  _partition_load_time BETWEEN timestamp('2016-07-06') AND    timestamp('2016-07-07') 
                AND    ( 
                              v2locations LIKE '%Spain%' 
                       OR     v2locations LIKE '%Chile%')) 
GROUP BY date, 
         country 
ORDER BY country, 
         date

现在,问题在于,对于智利西班牙的条目,需要在西班牙的两个组中取平均值和智利。正如代码现在,我得到 Chile 的正确结果,因为它首先按字母顺序排列,但显然西班牙的结果是错误的,因为这两个国家的条目都有被平均分配到 Chile 组。

我的问题是:如何在两个组的V2Locations列中对包含 Spain Chile 这两个词的条目进行分组?有可能吗?

编辑: 虽然下面的答案确实回答了我的问题,但它们也会带来相当高的成本。那么我现在正在做的是在本地进行聚合,以便能够利用GDELT数据库的分区特性。也就是说,我提出了各国不同交叉点的平均基调以及观测数量。这允许在本地(而不是在BigQuery上)计算实际的国家/地区。随着数据量的增加,这需要很长的时间来计算,但它可以节省相当多的成本,并且允许以零额外成本提取一个额外国家的数据。

SELECT date,  
       concat(IF(regexp_match(country,'Cataluna'),'Cataluna',''),
              IF(regexp_match(country,'Chile'),'Chile',''),'') country, 
       AVG(FLOAT(tone)) Tone, 
       count(tone) num,
FROM (
       SELECT INTEGER(REGEXP_REPLACE(STRING(DATE), r'\d{6}$', '')) date,
              REGEXP_REPLACE(V2Tone, r',.*', '') tone, 
       V2Locations country,
       FROM [gdelt-bq:gdeltv2.gkg_partitioned]
       WHERE _PARTITION_LOAD_TIME BETWEEN TIMESTAMP('2016-05-01')
                                         AND TIMESTAMP('2016-10-23')
       AND (V2Locations like '%Cataluna%'
           OR V2Locations like '%Chile%')) 
GROUP BY date, country
ORDER BY country, date

2 个答案:

答案 0 :(得分:0)

这很棘手。以下是一种使用JOIN到派生表来查找与国家匹配的方法:

SELECT date, c.country, AVG(FLOAT(tone)) Tone
FROM (SELECT INTEGER(REGEXP_REPLACE(STRING(DATE), r'\d{6}$', '')) as date,
             REGEXP_REPLACE(V2Tone, r',.*', '') tone,
             c.country 
      FROM [gdelt-bq:gdeltv2.gkg_partitioned] gkg CROSS JOIN
           (SELECT 'Chile' as country UNION ALL
            SELECT 'Spain' as country
           ) c
           ON gkg.V2Locations LIKE CONCAT('%', c.country, '%')
      WHERE _PARTITION_LOAD_TIME BETWEEN TIMESTAMP('2016-07-06') AND
            TIMESTAMP('2016-07-07')
     ) x
GROUP BY date, country 
ORDER BY country, date

答案 1 :(得分:0)

尝试以下(BigQuery传统SQL模式)

class InSocket: NSObject, GCDAsyncUdpSocketDelegate {
    let IP = "255.255.255.255"
    let PORT:UInt16 = 3520

    var socket:GCDAsyncUdpSocket!

    override init(){
        super.init()
        setupConnection()
    }

    func setupConnection(){
        let error : NSError?
        socket = GCDAsyncUdpSocket(delegate: self, delegateQueue: dispatch_get_main_queue())
        do {
            try socket.connectToHost(IP, onPort: PORT)
            try socket.beginReceiving()
        }catch let cError as NSCocoaError{
            print(cError)
        } catch {
            print(error)
        }
    }

    func udpSocket(sock: GCDAsyncUdpSocket, didReceiveData data: NSData, fromAddress address: NSData, withFilterContext filterContext: AnyObject?) {
        print("incoming message: \(data)");
    }
}