Question

我通常使用MySQL数据库，我目前在查询SQL Server数据库时遇到一些问题。

我正在尝试获取按天分组的列的平均值。这需要20-30秒，即使它只返回几百行。

但该表包含数百万个条目。我确定这与索引属性有关，但我似乎无法在这里找到正确的解决方案。

所以查询如下：

select 
    [unit_id], 
    avg(weight) AS avg, 
    max(timestamp) AS dateDay 
from 
    [measurements] 
where 
    timestamp BETWEEN '2017-06-01' AND '2017-10-04' 
group by 
    [unit_id], CAST(timestamp AS DATE) 
order by 
    [unit_id] asc, [dateDay] asc

我已经设置了一个包含unit_id，weight和timestamp字段的非聚集索引。

Answer 1

这是您的查询：

n_dim = f.shape[1]
train_x, test_x, train_y, test_y = train_test_split(f,l,test_size=0.1, shuffle =False)
print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)
learning_rate = 0.01
training_epochs = 1000
cost_history = np.empty(shape=[1],dtype=float)

X = tf.placeholder(tf.float32,[None,n_dim])
Y = tf.placeholder(tf.float32,[None,1])
W = tf.Variable(tf.ones([n_dim,1]))

#init = tf.initialize_all_variables()
init = tf.global_variables_initializer()

y_ = tf.matmul(X, W)
cost = tf.reduce_mean(tf.square(y_ - Y))
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)


sess = tf.Session()
sess.run(init)

for epoch in range(training_epochs):
    sess.run(training_step,feed_dict={X:train_x,Y:train_y})
    cost_history = np.append(cost_history,sess.run(cost,feed_dict={X: train_x,Y: train_y}))

    plt.plot(range(len(cost_history)),cost_history)
plt.axis([0,training_epochs,0,np.max(cost_history)])
plt.show()

pred_y = sess.run(y_, feed_dict={X: test_x})
mse = tf.reduce_mean(tf.square(pred_y - test_y))
print("MSE: %.4f" % sess.run(mse)) 

fig, ax = plt.subplots()
ax.scatter(test_y, pred_y)
ax.plot([test_y.min(), test_y.max()], [test_y.min(), test_y.max()], 'k--', lw=3)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

 </ blink>

this is the mistake




  \session.py", line 1100, in _run
        % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))

    ValueError: Cannot feed value of shape (387, 7, 10) for Tensor 'Placeholder_12:0', which has shape '(?, 7)'

在对数据的合理假设下，它在MySQL或SQL Server中具有相似的性能。您select unit_id, avg(weight) AS avg, max(timestamp) AS dateDay from measurements m where timestamp BETWEEN '2017-06-01' AND '2017-10-04' group by unit_id, CAST(timestamp AS DATE) order by unit_id asc, dateDay asc;的选择性不高。由于不平等，SQL Server无法使用WHERE的索引。

GROUP BY上的索引可能会使任一数据库上的查询受益。可能有一些奇特的方法可以让SQL Server提高性能。但它和MySQL都需要采用与measurements(timestamp, unit_id, weight)子句匹配的行并聚合它们（在SQL Server中使用基于散列的算法并在MySQL中使用文件排列）。

Answer 2

问题可能是该组中的CAST。虽然你没有明确说出来，但我假设Timestamp是一个DateTime值，这就是你在group by子句中CAST to Date的原因。问题是CAST产生的计算值没有被索引。

如果它是您的系统，并且此查询经常完成，我会添加一个Date类型的新列来存储该日期，并将其编入索引。如果您不能，请选择您感兴趣的日期范围内的值，将日期转换为日期，放入临时表或CTE，然后按日期分组。

或者，即使尝试这样做，只需将CAST从Group By子句中拉出来：

select 
    [unit_id], 
    avg(weight) AS avg, 
    dateDay 
from (
    select  [unit_id], 
            CAST(timestamp as Date) [dateDay],
            weight
        from [measurements] 
        where 
            timestamp BETWEEN '2017-06-01' AND '2017-10-04' 
    ) x
group by 
    x.[unit_id], x.[dateDay]
order by 
    x.[unit_id] asc, x.[dateDay] asc

SQL Server慢查询以获得平均

2 个答案: