Python - 在数据操作中和写入文件之前将列转置为行

时间:2014-07-28 22:50:39

标签: python csv transpose

我为Splunk开发了一个公共和开源应用程序(适用于Unix和Linux系统的Nmon性能监视器,请参阅https://apps.splunk.com/app/1753/

App的一个主要部分是由App自动启动的旧perl(循环,修改和更新)脚本,用于转换Nmon数据(这是某种自定义csv),从stdin读取并写出来 以前格式化的csv文件的部分(一个部分是一个性能监视器)

我现在想用Python完全重写这个脚本,这几乎是为第一个beta版本完成的......但是我面临着转置数据的困难,我担心自己无法解决它。

这就是为什么我今天请求帮助。

详情有困难:

Nmon为各个部分(cpu,内存,磁盘......)生成性能监视器,对于其中许多部分来说,没有很大的困难,但提取好的时间戳等等。 但是对于所有具有“设备”概念的部分(例如提供的示例中的DISKBUSY,表示磁盘忙碌的时间百分比)必须进行转换并转换为稍后的部分 利用的

目前,我可以按如下方式生成数据:

示例:

time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0

我们的目标是转移我们将在标题“时间,设备,价值”中提供的数据,例如:

time,device,value
26-JUL-2014 11:10:44,sda,4.4
26-JUL-2014 11:10:44,sda1,0.0
26-JUL-2014 11:10:44,sda2,0.0

等等。

一个月前,我已经针对几乎相同的需求打开了一个问题(对于另一个应用程序而不是完全相同的数据,但同样需要将列转换为行)

Python - CSV time oriented Transposing large number of columns to rows

我有一个非常好的答案,完美的伎俩,因此我无法将这段代码回收到这个新的上下文中。 不同之处在于我希望在代码中包含数据转置,这样脚本只能在内存中工作,避免处理多个临时文件。

以下是一段代码:

注意:需要使用Python 2x

###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################


dynamic_section = ["DISKBUSY"]

for section in dynamic_section:

    # Set output file
    currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'

    # Open output for writing
    with open(currsection_output, "w") as currsection:

        for line in data:

            # Extract sections, and write to output
            myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
            find_section = re.match( myregex, line)
            if find_section:

                # csv header

                # Replace some symbols
                line=re.sub("%",'_PCT',line)
                line=re.sub(" ",'_',line)

                # Extract header excluding data that always has Txxxx for timestamp reference
                myregex = '(' + section + ')\,([^T].+)'
                fullheader_match = re.search( myregex, line)            

                if fullheader_match:
                    fullheader = fullheader_match.group(2)

                    header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)    

                    if header_match:
                        header = header_match.group(2)

                        # Write header
                        currsection.write('time' + ',' + header + '\n'),


                # Extract timestamp

                # Nmon V9 and prior do not have date in ZZZZ
                # If unavailable, we'll use the global date (AAA,date)
                ZZZZ_DATE = '-1'
                ZZZZ_TIME = '-1'                

                # For Nmon V10 and more             

                timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
                if timestamp_match:
                    ZZZZ_TIME = timestamp_match.group(2)
                    ZZZZ_DATE = timestamp_match.group(3)            
                    ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # For Nmon V9 and less                  

                if ZZZZ_DATE == '-1':
                    ZZZZ_DATE = DATE
                    timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
                    if timestamp_match:
                        ZZZZ_TIME = timestamp_match.group(2)                    
                        ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # Extract Data
                myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
                perfdata_match = re.match( myregex, line)
                if perfdata_match:
                    perfdata = perfdata_match.group(2)

                    # Write perf data
                    currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),

        # End for

    # Open output for reading and show number of line we extracted
    with open(currsection_output, "r") as currsection:

        num_lines = sum(1 for line in currsection)
        print (section + " section: Wrote", num_lines, "lines")

# End for

行:

                currsection.write('time' + ',' + header + '\n'),

将包含标题

行:

            currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),

逐行包含数据

注意:最终数据(标题和正文数据)应该在目标中还包含其他信息,以简化我在上面的代码中删除它的内容

对于不需要数据转置的静态部分,相同的行将是:

                    currsection.write('type' + ',' + 'serialnum' + ',' + 'hostname' + ',' + 'time' + ',' + header + '\n'),

                currsection.write(section + ',' + SN + ',' + HOSTNAME + ',' + ZZZZ_timestamp + ',' + perfdata + '\n'),

最大的目标是能够在所需定义之后和编写之前转置数据。

此外,调用性能和最小系统资源(例如使用临时文件而不是内存)是防止在系统上定期生成过高CPU压力的必要条件。

有人能帮助我实现这个目标吗?我一次又一次地寻找它,我很确定有多种方法可以实现这一点(zip,map,dictionary,list,split ...)但是我没能实现它......

请放纵,这是我的第一个真正的Python脚本: - )

非常感谢您的帮助!

更多详情:

  • 测试nmon文件

可在此处检索小型测试nmon文件:http://pastebin.com/xHLRbBU0

  • 当前完整的脚本

可以在此处检索当前完整的脚本:http://pastebin.com/QEnXj6Yh

要测试脚本,需要:

  • 将SPLUNK_HOME变量导出为与您相关的任何内容,例如:

    mkdir / tmp / nmon2csv

- >将脚本和nmon文件放在此处,允许在脚本上执行

export SPLUNK_HOME=/tmp/nmon2csv
mkdir -p etc/apps/nmon

最后:

cat test.nmon | ./nmon2csv.py

将在/ tmp / nmon2csv / etc / apps / nmon / var / *

中生成数据

更新:使用csv模块的工作代码:

###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################

dynamic_section = ["DISKBUSY","DISKBSIZE","DISKREAD","DISKWRITE","DISKXFER","DISKRIO","DISKWRIO","IOADAPT","NETERROR","NET","NETPACKET","JFSFILE","JFSINODE"]

for section in dynamic_section:

    # Set output file (will opened after transpose)
    currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'

    # Open Temp
    with TemporaryFile() as tempf:

        for line in data:

            # Extract sections, and write to output
            myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
            find_section = re.match( myregex, line)
            if find_section:

                # csv header

                # Replace some symbols
                line=re.sub("%",'_PCT',line)
                line=re.sub(" ",'_',line)

                # Extract header excluding data that always has Txxxx for timestamp reference
                myregex = '(' + section + ')\,([^T].+)'
                fullheader_match = re.search( myregex, line)            

                if fullheader_match:
                    fullheader = fullheader_match.group(2)

                    header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)    

                    if header_match:
                        header = header_match.group(2)

                        # Write header
                        tempf.write('time' + ',' + header + '\n'),  

                # Extract timestamp

                # Nmon V9 and prior do not have date in ZZZZ
                # If unavailable, we'll use the global date (AAA,date)
                ZZZZ_DATE = '-1'
                ZZZZ_TIME = '-1'                

                # For Nmon V10 and more             

                timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
                if timestamp_match:
                    ZZZZ_TIME = timestamp_match.group(2)
                    ZZZZ_DATE = timestamp_match.group(3)            
                    ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # For Nmon V9 and less                  

                if ZZZZ_DATE == '-1':
                    ZZZZ_DATE = DATE
                    timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
                    if timestamp_match:
                        ZZZZ_TIME = timestamp_match.group(2)                    
                        ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # Extract Data
                myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
                perfdata_match = re.match( myregex, line)
                if perfdata_match:
                    perfdata = perfdata_match.group(2)

                    # Write perf data
                    tempf.write(ZZZZ_timestamp + ',' + perfdata + '\n'),


        # Open final for writing
        with open(currsection_output, "w") as currsection:

            # Rewind temp
            tempf.seek(0)

            writer = csv.writer(currsection)
            writer.writerow(['type', 'serialnum', 'hostname', 'time', 'device', 'value'])           

            for d in csv.DictReader(tempf):
                time = d.pop('time')
                for device, value in sorted(d.items()):
                    row = [section, SN, HOSTNAME, time, device, value]
                    writer.writerow(row)            

            # End for

    # Open output for reading and show number of line we extracted
    with open(currsection_output, "r") as currsection:

        num_lines = sum(1 for line in currsection)
        print (section + " section: Wrote", num_lines, "lines")

# End for

1 个答案:

答案 0 :(得分:2)

  

在   目标是转移我们将在标题中的数据   "时间,设备,值"

这个粗略的换位逻辑看起来像这样:

text = '''time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0
'''

import csv

for d in csv.DictReader(text.splitlines()):
    time = d.pop('time')
    for device, value in sorted(d.items()):
        print time, device, value

将所有内容组合成一个完整的脚本看起来像这样:

import csv

with open('transposed.csv', 'wb') as destfile:
    writer = csv.writer(destfile)
    writer.writerow(['time', 'device', 'value'])
    with open('data.csv', 'rb') as sourefile:
        for d in csv.DictReader(sourcefile):
            time = d.pop('time')
            for device, value in sorted(d.items()):
                row = [time, device, value]
                writer.writerow(row)