Question

我按照https://medium.com/towards-data-science/howto-profile-tensorflow-1a49fb18073d

的说明执行张量流分析

以下是测试代码：

import tensorflow as tf
import numpy as np
import time
from tensorflow.python.client import timeline
import json

W=3000
H=4000

in_a = tf.placeholder(tf.float32,(W,H))
in_b = tf.placeholder(tf.float32,(W,H))

def test_sub(number):
    sess=tf.Session()
    options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    many_runs_timeline=TimeLiner()

    out = tf.subtract(in_a,in_b)
    a=np.random.rand(W,H)
    b=np.random.rand(W,H)

    for i in range(number):
        feed_dict = {in_a:a,
                 in_b:b}
        t0=time.time()

        out_ = sess.run(out,feed_dict=feed_dict,options=options,run_metadata=run_metadata)

        fetched_timeline = timeline.Timeline(run_metadata.step_stats)
        chrome_trace = fetched_timeline.generate_chrome_trace_format()
        many_runs_timeline.update_timeline(chrome_trace)

        t_=(time.time()-t0) * 1000
        print "index:",str(i), " total time:",str(t_)," ms"
    many_runs_timeline.save("timeline_merged_run_test.json")

class TimeLiner:
    _timeline_dict = None

    def update_timeline(self, chrome_trace):
        #convert crome trace to python dict
        chrome_trace_dict = json.loads(chrome_trace)

        # for first run store full trace
        if self._timeline_dict is None:
            self._timeline_dict = chrome_trace_dict

        # for other - update only time consumption, not definitions
        else:
            for event in chrome_trace_dict['traceEvents']:
                # events time consumption started with 'ts' prefix
                if 'ts' in event:
                    self._timeline_dict['traceEvents'].append(event)

    def save(self, f_name):     
        with open(f_name, 'w') as f:
            json.dump(self._timeline_dict, f)

test_sub(20)

代码运行结果：
ndex：0总时间：338.145017624 ms
指数：1总时间：137.024879456 ms
指数：2总时间：132.538080215 ms
指数：3总时间：133.152961731 ms
指数：4总时间：132.885932922 ms
指数：5总时间：135.06102562 ms
指数：6总时间：136.723041534 ms
指数：7总时间：137.926101685 ms
指数：8总时间：133.605003357 ms
指数：9总时间：133.143901825 ms
指数：10总时间：136.317968369 ms
指数：11总时间：137.830018997 ms
指数：12总时间：135.458946228 ms
指数：13总时间：132.793903351 ms
指数：14总时间：144.603967667 ms
指数：15总时间：134.593963623 ms
指数：16总时间：135.535001755 ms
指数：17总时间：133.697032928 ms
指数：18总时间：136.134147644 ms
指数：19总时间：133.810043335 ms

以下图片是分析结果：Profile Result

我的问题是：
1. / gpu：0 / stream：31（在分析结果的顶部）和/ job：localhost / replica：0 / task：0 / gpu：0（在分析的底部）有什么区别结果）和tensorflow操作的执行时间是多少？作者（在上面的链接中）显示/ job：localhost部分是作业的配置文件时间，但我们可以从分析结果中看出/ gpu：0 / stream：31花费的时间更长。
2.运行结果显示执行session.run（）大约需要140 ms，而分析结果显示它只需要大约20 ms。并且两个连续的session.run（）之间有很多时间间隔。系统在时间间隔内做了什么？

Answer 1

回答你的第二个问题：在英特尔VTune放大器的帮助下，有一种方法可以确定该区间内发生的事情（该工具不是免费的，但有免费的全功能学术版和试用版）。您可以使用this article中的配方将时间线数据导入Intel VTune Amplifier并在那里进行分析。您将需要帧域/源功能分组。展开[无框架域 - 任何框架外]行，您将获得您感兴趣的区间内发生的热点列表。

谁能解释张量流的分析结果

1 个答案: