Python:解析度量数据的简单脚本

时间:2012-07-16 14:54:49

标签: python

我需要修改一个小的Python脚本,因为metrics文件的格式稍有改变。我根本不懂Python,并试图自己努力修复它。这些变化对我来说很有意义,但显然脚本仍有一个问题。否则,其他一切正常。这是脚本的样子:

import sys
import datetime

##########################################################################

now = datetime.datetime.now();
logFile = now.strftime("%Y%m%d")+'.QE-Metric.log';

underlyingParse = True;
strParse = "UNDERLYING_TICK";
if (len(sys.argv) == 2):
    if sys.argv[1] == '2':
    strParse = "ORDER_SHOOT";
        underlyingParse = False;
elif (len(sys.argv) == 3):
    logFile = sys.argv[2];    
    if sys.argv[1] == '2':
    strParse = "ORDER_SHOOT";
        underlyingParse = False;
else:
    print 'Incorrect number of arguments. Usage: <exec> <mode (1) Underlying (2) OrderShoot> <FileName (optional)>'
    sys.exit()

##########################################################################

# Read the deployment file
FIput = open(logFile, 'r');
FOput = open('ParsedMetrics.txt', 'w');

##########################################################################

def ParseMetrics( file_lines ):

    ii = 0
    tokens = []; 
    for ii in range(len(file_lines)):

        line = file_lines[ii].strip()

        if (line.find(strParse) != -1):

             tokens = line.split(",");
             currentTime = float(tokens[2])

             if (underlyingParse == True and ii != 0):
                 newIndex = ii-1
                 prevLine = file_lines[newIndex].strip()
                 while (prevLine.find("ORDER_SHOOT") != -1 and newIndex > -1):
                     newIndex -= 1;
                     tokens = prevLine.split(",");
                     currentTime -= float(tokens[2]);
                     prevLine = file_lines[newIndex].strip();

         if currentTime > 0:
                 FOput.write(str(currentTime) + '\n')

##########################################################################

file_lines = FIput.readlines()
ParseMetrics( file_lines );

print 'Metrics parsed and written to ParsedMetrics.txt'

一切正常,除了因为上次发生UNDERLYING_TICK事件后应该反向迭代前一行以加起ORDER_SHOOT数字的逻辑(从代码开始:if(underlyingParse == True和ii!= 0) :...)然后从当前处理的UNDERLYING_TICK事件行中减去该总数。这就是正在解析的文件中的典型行:

08:40:02.039387(+26): UNDERLYING_TICK, 1377, 1499.89

基本上,我只对最后一个数据元素(1499.89)感兴趣,这是微观时间。我知道它必须是愚蠢的东西。我只需要另一双眼睛。谢谢!

2 个答案:

答案 0 :(得分:0)

目前还不清楚您的输出有什么问题,因为您没有显示输出,我们无法理解您的输入。

我假设以下内容:

  1. 行被格式化为“absolutetime:TYPE,positiveinteger,float_time_duration_in_ms”,其中最后一项是事物所花费的时间。
  2. 行按“absolutetime”排序。因此,属于UNDERLYING_TICK的ORDER_SHOOT始终位于自上一个UNDERLYING_TICK(或文件的开头)以来的行上,而这些行。如果此假设为true,则需要先对文件进行排序。您可以使用单独的程序(例如sort的管道输出)执行此操作,或使用bisect模块存储已排序的行并轻松提取相关行。
  3. 如果这两个假设都为真,请查看以下脚本。 (未经测试,因为我没有大的输入样本或输出样本来进行比较。)

    这是一个更加Pythonic的样式,更容易阅读和理解,不使用全局变量作为函数参数,并且应该更高效,因为它不会向后遍历行或加载整个文件进入内存来解析它。

    它还演示了如何使用argparse module进行命令行解析。这不是必需的,但是如果你有很多命令行Python脚本,你应该熟悉它。

    import sys
    
    VALIDTYPES = ['UNDERLYING_TICK','ORDER_SHOOT']
    
    def parseLine(line):
        # format of `tokens`:
        # 0 = absolute timestamp
        # 1 = event type
        # 2 = ???
        # 3 = timedelta (microseconds)
        tokens = [t.strip(':, \t') for t in line.strip().split()]
        if tokens[1] not in VALIDTYPES:
            return None
        tokens[2] = int(tokens[2])
        tokens[3] = float(tokens[3])
        return tuple(tokens)
    
    def parseMetrics(lines, parsetype):
        """Yield timedelta for each line of specified type
    
        If parsetype is 'UNDERLYING_TICK', subtract previous ORDER_SHOOT 
        timedeltas from the current UNDERLYING_TICK delta before yielding
        """
        order_shoots_between_ticks = []
        for line in lines:
            tokens = parseLine(line)
            if tokens is None:
                continue # go home early
            if parsetype=='UNDERLYING_TICK':
                if tokens[1]=='ORDER_SHOOT':
                    order_shoots_between_ticks.append(tokens)
                elif tokens[1]=='UNDERLYING_TICK':
                    adjustedtick = tokens[3] - sum(t[3] for t in order_shoots_between_ticks)
                    order_shoots_between_ticks = []
                    yield adjustedtick
            elif parsetype==tokens[1]:
                yield tokens[3]
    
    def parseFile(instream, outstream, parsetype):
        printablelines = ("{0:f}\n".format(time) for time in parseMetrics(instream, parsetype))
        outstream.writelines(printablelines)
    
    def main(argv):
        import argparse, datetime
        parser = argparse.ArgumentParser(description='Output timedeltas from a QE-Metric log file')
        parser.add_argument('mode', type=int, choices=range(1, len(VALIDTYPES)+1),
            help="the types to parse. Valid values are: 1 (Underlying), 2 (OrderShoot)")
        parser.add_argument('infile', required=False,
            default='{}.QE-Metric.log'.format(datetime.datetime.now().strftime('%Y%m%d'))
            help="the input file. Defaults to today's file: YYYYMMDD.QE-Metric.log. Use - for stdin.")
        parser.add_argument('outfile', required=False,
            default='ParsedMetrics.txt',
            help="the output file. Defaults to ParsedMetrics.txt. Use - for stdout.")
        parser.add_argument('--verbose', '-v', action='store_true')
        args = parser.parse_args(argv)
    
        args.mode = VALIDTYPES[args.mode-1]
    
        if args.infile=='-':
            instream = sys.stdin
        else:
            instream = open(args.infile, 'rb')
    
        if args.outfile=='-':
            outstream = sys.stdout
        else:
            outstream = open(args.outfile, 'wb')
    
        parseFile(instream, outstream, args.mode)
    
        instream.close()
        outstream.close()
    
        if args.verbose:
            sys.stderr.write('Metrics parsed and written to {0}\n'.format(args.outfile))
    
    
    
    if __name__=='__main__':
        main(sys.argv[1:])
    

答案 1 :(得分:0)

因此,如果命令行选项为2,则该函数会创建一个输出文件,其中所有行只包含输入文件中包含“order_shoot”标记的行的“时间”部分?

如果命令行选项为1,则该函数创建一个输出文件,输入文件中包含'underlying_tick'标记的每一行都有一行,除了你想要的数字是unders_tick时间值减去所有order_shoot在前面的underlying_tick值之后发生的时间值(如果这是第一个,则从文件的开头开始)?

如果这是正确的,并且所有行都是唯一的(没有重复),那么我会建议以下重写的脚本:

#### Imports unchanged.

import sys 
import datetime 

#### Changing the error checking to be a little simpler.
#### If the number of args is wrong, or the "mode" arg is
#### not a valid option, it will print the error message
#### and exit.

if len(sys.argv) not in (2,3) or sys.argv[2] not in (1,2):
    print 'Incorrect arguments. Usage: <exec> <mode (1) Underlying (2) OrderShoot> <FileName (optional)>'
    sys.exit()  

#### the default previously specified in the original code.

now = datetime.datetime.now()

#### Using ternary logic to set the input file to either
#### the files specified in argv[2] (if it exists), or to
#### the default previously specified in the original code.

FIput = open((sys.argv[2] if len(sys.argv)==3 
                          else now.strftime("%Y%m%d")+'.QE-Metric.log'), 'r');

#### Output file not changed.

FOput = open('ParsedMetrics.txt', 'w');

#### START RE-WRITTEN FUNCTION

def ParseMetrics(file_lines,mode): 

#### The function now takes two params - the lines from the 
#### input file, and the 'mode' - whichever the user selected
#### at run-time. As you can see from the call down below, this
#### is taken straight from argv[1]. 

    if mode == '1':

#### So if we're doing underlying_tick mode, we want to find each tick,
#### then for each tick, sum the preceding order_shoots since the last
#### tick (or start of file for the first tick).

        ticks = [file_lines.index(line) for line in file_lines \
                                        if 'UNDERLYING_TICK' in line]

#### The above list comprehension iterates over file_lines, and creates
#### a list of the indexes to file_lines elements that contain ticks.
#### 
#### Then the following loop iterates over ticks, and for each tick,
#### subtracts the sum of all times for order_shoots that occure prior
#### to the tick, from the time value of the tick itself. Then that
#### value is written to the outfile.

        for tick in ticks:
            sub_time = float(file_lines[tick].split(",")[2]) - \
                       sum([float(line.split(",")[2]) \ 
                       for line in file_lines if "ORDER_SHOOT" in line \
                       and file_lines.index(line) <= tick]
            FOput.write(float(line.split(",")[2]))    

#### if the mode is 2, then it just runs through file_lines and
#### outputs all of the order_shoot time values.

    if mode == '2':
        for line in file_lines:
            if 'ORDER_SHOOT' in line:
                FOput.write(float(line.split(",")[2]))

#### END OF REWRITTEN FUNCTION

#### As you can see immediately below, we pass sys.argv[2] for the
#### mode argument of the ParseMetrics function.

ParseMetrics(FIput.readlines(),sys.argv[2])

print 'Metrics parsed and written to ParsedMetrics.txt' 

这应该可以解决问题。主要问题是如果你有任何“UNDERLYING_TICK”的行与任何其他这样的行完全重复,那么这将不起作用。需要使用不同的逻辑来获得正确的索引。

我相信有一种方法可以让这更好,但这是我的第一个想法。

值得注意的是,为了便于阅读,我在上面的源代码中添加了许多内联换行符,但是如果你按照书面形式使用它,你可能想要提取它们。