Question

我有一个文本文件，其中包含多行，格式如下：

real    0m0.020s
user    0m0.000s
sys 0m0.000s
Round  1  completed. with matrix size of  1200 x 1200 with threads 8

real    0m0.022s
user    0m0.000s
sys 0m0.001s
Round  2  completed. with matrix size of  1200 x 1200 with threads 8

大约有500个此类条目（以上是2个示例）。我似乎无法弄清楚如何将它们放入看起来像这样的熊猫数据框中：

Matrix Size    Threads    Round    Real    User    Sys
1200 x 1200    8          1        0.0020  0.0000  0.0000
1200 x 1200    8          2        0.0022  0.0000  0.0001

是否有使用正则表达式的方法或其他将测试输出转换为数据帧的方法。另外，我不知道我是否正确解释了时间，因为它们是0m（我认为是0分钟）和0.02（我认为是0.02秒）

Answer 1

您可以使用正则表达式：

import re
import pandas as pd

regex = re.compile(r'real +(\dm\d\.\d+s)\nuser +(\dm\d\.\d+s)\nsys +(\dm\d\.\d+s)\nRound +(\d+).+of +(\d+ x \d+).+threads (\d+)')

df = pd.DataFrame(regex.findall(data), columns=['real', 'user', 'sys', 'round', 'matrix size', 'threads'])

print(df)

输出：

       real      user       sys round  matrix size threads
0  0m0.020s  0m0.000s  0m0.000s     1  1200 x 1200       8
1  0m0.022s  0m0.000s  0m0.001s     2  1200 x 1200       8

Answer 2

如果您只想使用pandas来解决问题，则可以使用str.split()：

# data
s = """real    0m0.020s
user    0m0.000s
sys 0m0.000s
Round  1  completed. with matrix size of  1200 x 1200 with threads 8

real    0m0.022s
user    0m0.000s
sys 0m0.001s
Round  2  completed. with matrix size of  1200 x 1200 with threads 8"""

# str.split on two line breaks for rows then split on the text
df = pd.DataFrame(s.split('\n\n'))[0].str.split('   |real | with |user    |sys |matrix size of  |threads |\n')\
                                  .apply(lambda x: [s for s in x if s]).apply(pd.Series)

# split col 3 on round and completed to get number of rounds
df[3] = df[3].str.strip('Round | completed.')

# rename columns
df.columns = ['real', 'user', 'sys', 'round', 'matrix size', 'threads']

出

       real      user       sys round  matrix size threads
0  0m0.020s  0m0.000s  0m0.000s     1  1200 x 1200       8
1  0m0.022s  0m0.000s  0m0.001s     2  1200 x 1200       8

请注意，它将是gmds的示例：

1000 loops, best of 3: 4.42 ms per loop与1000 loops, best of 3: 1.84 ms per loop

正则表达式文本到熊猫数据框

2 个答案: