合并多个csv文件

时间:2015-07-18 06:17:20

标签: python sql csv pandas

我正在使用q转换我的csv文件:log.csv(文件已链接)。 它的格式是:

datapath,port,rxpkts,rxbytes,rxerror,txpkts,txbytes,txerror
4,1,178,25159,0,40,3148,0
4,2,3,230,0,213,27897,0
4,3,3,230,0,212,27807,0
4,4,4,320,0,211,27717,0
4,5,3,230,0,212,27807,0
4,6,3,230,0,212,27807,0
4,7,4,320,0,211,27717,0
4,8,4,320,0,211,27717,0
4,9,4,320,0,211,27717,0
4,a,4,320,0,211,27717,0
4,b,3,230,0,212,27807,0
4,fffffffe,7,578,0,209,27549,0
3,1,197,26863,0,21,1638,0
3,2,3,230,0,215,28271,0
3,3,5,390,0,215,28271,0
3,4,2,140,0,216,28361,0
3,5,4,320,0,214,28181,0
3,6,3,230,0,215,28271,0
3,fffffffe,7,578,0,212,28013,0
5,1,208,27401,0,6,488,0
5,fffffffe,7,578,0,208,27401,0
2,1,180,24228,0,18,1368,0
2,2,2,140,0,195,25366,0
2,3,2,140,0,195,25366,0
2,4,3,230,0,194,25276,0
2,5,3,230,0,194,25276,0
2,6,2,140,0,195,25366,0
2,fffffffe,7,578,0,191,25018,0
1,1,38,5096,0,182,23602,0
1,2,42,5419,0,179,23369,0
1,3,61,7152,0,159,21546,0
1,4,28,4611,0,192,24087,0
1,5,46,6022,0,174,22676,0
1,fffffffe,7,578,0,214,28210,0

我想将其转换为以下格式: enter image description here

端口数量可能会有所不同。

当前代码:

python q -H -d "," "select rxpkts, txpkts from ./log.csv where datapath = i and port = j" > i_j.csv;

所以我创建了i*j个文件,然后手动组合它们。是否可以通过修改上述 sql查询使用python 组合文件或使用 pandas 按照评论中的建议一次性完成此操作?

import subprocess

def printit():
    for i in range(1,6):
        for j in range(1,6):
            query = "select rxpkts, txpkts from ./log.csv where datapath = "+str(i)+" and port = "+str(j)
            fileName = str(i)+"_"+str(j)+".csv"
            with open(fileName, "w+") as f:
                p = subprocess.Popen(["python", "q", "-H", "-d", ",", query], stdout=f)

printit()

1 个答案:

答案 0 :(得分:1)

您可以将set_indexstack一起使用。

import pandas as pd

# your data
# ======================================
print(df)

    datapath      port  rxpkts   ...     txpkts  txbytes  txerror
0          4         1     178   ...         40     3148        0
1          4         2       3   ...        213    27897        0
2          4         3       3   ...        212    27807        0
3          4         4       4   ...        211    27717        0
4          4         5       3   ...        212    27807        0
5          4         6       3   ...        212    27807        0
6          4         7       4   ...        211    27717        0
7          4         8       4   ...        211    27717        0
8          4         9       4   ...        211    27717        0
9          4         a       4   ...        211    27717        0
..       ...       ...     ...   ...        ...      ...      ...
24         2         4       3   ...        194    25276        0
25         2         5       3   ...        194    25276        0
26         2         6       2   ...        195    25366        0
27         2  fffffffe       7   ...        191    25018        0
28         1         1      38   ...        182    23602        0
29         1         2      42   ...        179    23369        0
30         1         3      61   ...        159    21546        0
31         1         4      28   ...        192    24087        0
32         1         5      46   ...        174    22676        0
33         1  fffffffe       7   ...        214    28210        0

[34 rows x 8 columns]


# reshaping
# ======================================
series_res = df[df.columns[:4]].set_index(['datapath', 'port']).stack()
series_res.name = 'value'



datapath  port             
4         1         rxpkts       178
                    rxbytes    25159
          2         rxpkts         3
                    rxbytes      230
          3         rxpkts         3
                    rxbytes      230
          4         rxpkts         4
                    rxbytes      320
          5         rxpkts         3
                    rxbytes      230
                               ...  
1         2         rxpkts        42
                    rxbytes     5419
          3         rxpkts        61
                    rxbytes     7152
          4         rxpkts        28
                    rxbytes     4611
          5         rxpkts        46
                    rxbytes     6022
          fffffffe  rxpkts         7
                    rxbytes      578
Name: value, dtype: int64



df_res = pd.DataFrame(series_res)
df_res.T

datapath      4                                         ...        1                                        
port          1              2              3           ...        4              5         fffffffe        
         rxpkts rxbytes rxpkts rxbytes rxpkts rxbytes   ...   rxpkts rxbytes rxpkts rxbytes   rxpkts rxbytes
value       178   25159      3     230      3     230   ...       28    4611     46    6022        7     578

[1 rows x 68 columns]