最小示例

# Create constant Matrix A
A1 = tf.convert_to_tensor(A, dtype=tf.float64)

# Compute the cholesky decomposition
sqrtA1 = tf.linalg.cholesky(tfA1)

xi = tf.placeholder(tf.float64, shape=[10000])

# Matrix multiplication with the chompsky matrix
RessqrtA1 = tf.tensordot(tfsqrtA1, tfxi, [[1],[0]])

# Regular Matrix multiplication
ResA1 = tf.tensordot(tfA1, tfxi, [[1],[0]])

计算如下：

with tf.Session() as sess:
    _ = sess.run(ResA1, feed_dict={xi: np.random.randn(10000)})
    _ = sess.run(ResA1, feed_dict={xi: np.random.randn(10000)})

人们可以假设第一个sess.run()将花费更多时间，因为必须构建和优化图。这就是发生的情况，这是一些预期的时间：

normal matmul
Wall time: 36.1 s -- first run 

41.1 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
--second run with timit

尽管运行时间恒定，但计算时间长

现在，当乘以sqrtA1时，可能会以为相同，但是不幸的是，却没有。
计算以下内容：

with tf.Session() as sess:
    _ = sess.run(RessqrtA1, feed_dict={xi: np.random.randn(10000)})
    _ = sess.run(RessqrtA1, feed_dict={xi: np.random.randn(10000)})

以下是一些时间安排：

cholesky
Wall time: 2min 38s
24.7 s ± 750 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

我运行了TensorFlow Profiler并获得了以下信息：

Profile:
node name | requested bytes | total execution time | accelerator execution time | cpu execution time
Cholesky                     800.00MB (100.00%, 50.00%),     24.30sec (100.00%, 99.81%),             0us (0.00%, 0.00%),     24.30sec (100.00%, 99.81%)
MatMul                          80.00KB (50.00%, 0.00%),         46.45ms (0.19%, 0.19%),             0us (0.00%, 0.00%),         46.45ms (0.19%, 0.19%)
Reshape                               0B (0.00%, 0.00%),            15us (0.00%, 0.00%),             0us (0.00%, 0.00%),            15us (0.00%, 0.00%)
Const                         800.00MB (50.00%, 50.00%),            10us (0.00%, 0.00%),             0us (0.00%, 0.00%),            10us (0.00%, 0.00%)

TF 1.7和TF 1.6之间的区别

现在有趣的部分：仅当您使用版本1.7.0或更高版本
时，此选项才 我在1.6.0版中进行了尝试，带有cholesky矩阵的Matmuls花费的时间不超过正常的matmul：

Wall time: 1min 18s -- first run 49.7 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) -- timeit on the second sess.run()

使用探查器查看1.6版中的计算时，您会得到：

Profile:node name | requested bytes | total execution time | accelerator execution time | cpu execution time Cholesky 800.00MB (100.00%, 100.00%), 891us (100.00%, 100.00%), 0us (0.00%, 0.00%), 891us (100.00%, 100.00%) ======================End of Report==========================

很少使用此代码替换RessqrtA1 = tf.tensordot(chol, tfxi, [[1],[0]])（其中chol是另一个占位符）中的%time temp = sess.run(tfsqrtA1) %timeit _ = sess.run(tfRessqrtA1, feed_dict={tfxi: np.random.randn(10000), chol: temp}) cholesky Wall time: 1min 40s 42.6 ms ± 3.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)，再次为我们提供了所需的结果。

tf.float32

我的想法：

自1.7版以来，尽管每次会话运行都是恒定的，但每次会话运行cholesky都会发生，因此只要计算一次并重用结果就足够了。

我认为这样的计算是在版本Graph的1.6版中发生的，因为我测量了 Cholesky分解的计算在V1.6中花费了 26秒 （使用了占位符，因此需要在运行时进行计算），但是在运行过程中，仅花费了900us，如上所示。

在v1.7中，他们以某种方式更改了此行为。

问题

您认为此行为更改是正常的还是预期的？如果是这样，当您遇到类似情况时，哪种方法更好？

我很感谢每一个信息，也许其他人注意到使用线性代数类的一些相关问题。

到目前为止我尝试过的事情：

调整浮点数（即tf.float64而不是timeit）

在单独的环境中使用了不同的版本。

使用探查器调查不同操作的时间

在另一台电脑上运行

Python的计时（例如pycallgraph或var "accounts" { default = ["123", "456", "789"] type = "list" } locals { accounts_arn = "${formatlist("arn:aws:iam::%s", var.accounts)}" }）

在Tensorflow中重用计算得出的常数值

最小示例

尽管运行时间恒定，但计算时间长

TF 1.7和TF 1.6之间的区别

我的想法：

问题

到目前为止我尝试过的事情：

0 个答案: