使用大型数据集模拟MYSQL中的完全连接

时间:2011-11-29 06:47:15

标签: mysql join union union-all

我有三个表,我需要根据公共字段加入数据。

示例伪表defs:

barometer_log (设备,压力浮动,sampleTime时间戳)

temperature_log (device int,temperature float,sampleTime timestamp)

magnitude_log (device int,magnitude float,utcTime timestamp)

每个表最终将包含数十亿行,但目前每行包含大约500,000行。

我需要能够将表​​中的数据(FULL JOIN)组合起来,以便将 sampleTime 合并为一列(COALESE),以便为我提供以下行: 设备,采样时间,压力,温度,幅度

我需要能够通过指定设备以及开始和结束日期来查询数据,例如 选择....其中,设备= 1000,“2011-10-11”和“2011-10-17”之间的采样时间

我尝试了使用RIGHT和LEFT连接的不同UNION ALL技术 正如MySql full join (union) and ordering on multiple date columnsMySql full join (union) and ordering on multiple date columns中所建议的那样,但查询耗时太长,我必须在运行数小时后停止它或抛出有关临时文件大小的错误。 查询三个表并在可接受的时间范围内合并输出的最佳方法是什么?

这是建议的完整表定义。 注意:尚未包含设备表。

magnitude_log

CREATE TABLE magnitude_log (
  device int(11) NOT NULL,
  magnitude float not NULL,
  sampleTime timestamp NOT NULL,  
  PRIMARY KEY  (device,sampleTime),
  CONSTRAINT magnitudeLog_device 
    FOREIGN KEY (device) 
      REFERENCES device (id) 
      ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

barometer_log

CREATE TABLE barometer_log (
  device int(11) NOT NULL,
  pressure float not NULL,  
  sampleTime timestamp NOT NULL,  
  PRIMARY KEY  (device,sampleTime),
  CONSTRAINT barometerLog_device 
    FOREIGN KEY (device) 
      REFERENCES device (id) 
      ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

temperature_log

CREATE TABLE temperature_log (
  device int(11) NOT NULL,
  sampleTime timestamp NOT NULL,  
  temperature float default NULL,
  PRIMARY KEY  (device,sampleTime),
  CONSTRAINT temperatureLog_device 
    FOREIGN KEY (device) 
      REFERENCES device (id) 
      ON DELETE CASCADE
)  ENGINE=InnoDB DEFAULT CHARSET=utf8;

3 个答案:

答案 0 :(得分:1)

首先,在所需的时间段内获取所有3个表中(device, sampleTime)的所有组合:

-------- Q --------
    SELECT device, sampleTime
    FROM magnitude_log
    WHERE device = 1000
      AND sampleTime >= '2011-10-11' 
      AND sampleTime <  '2011-10-18'
UNION
    SELECT device, sampleTime
    FROM barometer_log
    WHERE device = 1000
      AND sampleTime >= '2011-10-11' 
      AND sampleTime <  '2011-10-18'
UNION
    SELECT device, sampleTime
    FROM temperature_log
    WHERE device = 1000
      AND sampleTime >= '2011-10-11' 
      AND sampleTime <  '2011-10-18'

然后将其用于LEFT JOIN 3个表:

SELECT
    q.device
  , q.sampleTime
  , b.pressure
  , t.temperature
  , m.magnitude
FROM 
    ( Q ) AS q
  LEFT JOIN
    ( SELECT * 
      FROM magnitude_log 
      WHERE device = 1000
        AND sampleTime >= '2011-10-11' 
        AND sampleTime <  '2011-10-18'
    ) AS m
      ON (m.device, m.sampleTime) = (q.device, q.sampleTime)
  LEFT JOIN
    ( SELECT * 
      FROM barometer_log 
      WHERE device = 1000
        AND sampleTime >= '2011-10-11' 
        AND sampleTime <  '2011-10-18'
    ) AS b
      ON (b.device, b.sampleTime) = (q.device, q.sampleTime)
  LEFT JOIN
    ( SELECT * 
      FROM temperature_log_log 
      WHERE device = 1000
        AND sampleTime >= '2011-10-11' 
        AND sampleTime <  '2011-10-18'
    ) AS t
      ON (t.device, t.sampleTime) = (q.device, q.sampleTime)

您拥有的时间越长,查询与UNION子查询争用的时间就越长。您可以考虑将Q作为一个单独的表,可能通过触发器使用其他三个表中的唯一(device, sampleTime)组合填充它。

答案 1 :(得分:0)

假设表格device包含您并不真正需要正确完全加入的所有设备,您只需要加入device上的其他表格并在示例时间分组,如下所示:

SELECT
    d.id AS device,
    COALESCE(m.sampleTime, b.sampleTime, t.sampleTime) AS sampleTime,
    m.magnitude,
    b.pressure,
    t.temperature
FROM device AS d
    LEFT JOIN magnitude_log AS m ON d.id = m.device
    LEFT JOIN barometer_log AS b ON d.id = b.device
    LEFT JOIN temperature_log AS t ON d.id = t.device
WHERE d.id = 1000
GROUP BY device, sampleTime
HAVING sampleTime BETWEEN '2011-10-11' AND '2011-10-17'

然而,这可能会很慢,因为它会在时间跨度上实际匹配之前进行分组,但如果单个设备本身不会有数百万行,那么它应该不是问题。但是,如果是,我建议将sampleTime放在每个连接上:

SELECT
    d.id AS device,
    COALESCE(m.sampleTime, b.sampleTime, t.sampleTime) AS sampleTime,
    m.magnitude,
    b.pressure,
    t.temperature
FROM device AS d
    LEFT JOIN magnitude_log AS m ON d.id = m.device AND m.sampleTime BETWEEN '2011-10-11' AND '2011-10-17'
    LEFT JOIN barometer_log AS b ON d.id = b.device AND b.sampleTime BETWEEN '2011-10-11' AND '2011-10-17'
    LEFT JOIN temperature_log AS t ON d.id = t.device AND t.sampleTime BETWEEN '2011-10-11' AND '2011-10-17'
WHERE d.id = 1000
GROUP BY device, sampleTime
HAVING sampleTime IS NOT NULL

希望有所帮助!

答案 2 :(得分:0)

如果您要查询一小段时间范围和大量设备,您可能需要考虑反转PK索引来实现它(timeRange,device)。

您可能需要设备上的辅助索引或(device,timeRange)。