我有一个形状为A
的矩阵[X,128]
和一个形状为B
的矢量[128,1]
。
我想进行A.B乘法,这会产生[X,1]
输出。
如果我有足够的GPU内存,这将是简单快速的,但我没有足够的内存,并希望分批传递这些数据,并尽可能地减少计算时间。
我不确定如何将RAM中的批量数据发送到GPU以便尽快处理。
Heres'我写的Tensorflow代码非常慢:
import time
import numpy as np
import tensorflow as tf
NUMBER_OF_SAMPLES = 1097152
NUMBER_OF_DIMENSIONS = 256
print("Initializing Variables...")
Database_size_in_bytes = int(NUMBER_OF_SAMPLES * NUMBER_OF_DIMENSIONS * 4)
GPU_RAM_SIZE = 8192 * 1024 * 1024 # X MB RAM
NUMBER_OF_SAMPLES_GPU_CAN_HANDLE = Database_size_in_bytes // GPU_RAM_SIZE
print("Generating the Data...")
A_Placeholder = tf.placeholder(tf.float32, shape=[None, NUMBER_OF_DIMENSIONS])
B_Constant = tf.constant(np.random.rand(NUMBER_OF_DIMENSIONS, 1).astype(np.float32))
A_Data = np.random.random_sample((NUMBER_OF_SAMPLES, NUMBER_OF_DIMENSIONS)).astype(np.float32)
B_Data = np.random.rand(NUMBER_OF_DIMENSIONS, 1).astype(np.float32)
multiplication_op = tf.matmul(A_Placeholder, B_Constant)
sess = tf.Session()
print("Multiplicating...")
if NUMBER_OF_SAMPLES_GPU_CAN_HANDLE > 0:
print("Not Enough GPU Memory...")
A_Data_Splits = np.split(A_Data, NUMBER_OF_SAMPLES_GPU_CAN_HANDLE)
times = []
for i in range(0, 100):
for j, _ in enumerate(A_Data_Splits):
start_time = time.time()
output = sess.run(multiplication_op, feed_dict={A_Placeholder: A_Data_Splits[j]})
# print(output[0])
times.append(time.time() - start_time)
else:
print("Enough GPU Memory... Initializing Variables before use...")
sess.run(tf.global_variables_initializer(), feed_dict={A_Placeholder: A_Data})
times = []
for i in range(0, 100):
start_time = time.time()
output = sess.run(multiplication_op, feed_dict={A_Placeholder: A_Data})
# print(output[0])
times.append(time.time() - start_time)
print("Average Time => %s" % np.mean(times))
sess.close()
如果批量生产的Dot产品有其他快速替代品,我想知道它们是否可行。
由于