我有几行如下所示的字符串:
data =
[15:07:29] (+?.?????????) host_name data: { cpu_id = 0 }, { var1 = 3, var2 = 4, var3 = 30, var4 = 87.7187 }
[15:07:30] (+0:0:1) host_name data: { cpu_id = 0 }, { var1 = 4, var2 = 4, var3 = 29, var4 = 0.073525 }
我想要一个类似的熊猫DataFrame:
这样做,我首先在新行上分割行,这产生了一个列表,然后是数据框:
data_list = data.split('\n')
['[15:07:29] (+?.?????????) host_name data: { cpu_id = 0 }, { var1 = 3, var2 = 4, var3 = 30, var4 = 87.7187 }', '[15:07:30] (+0:0:1) host_name data: { cpu_id = 0 }, { var1 = 4, var2 = 4, var3 = 29, var4 = 0.073525 }']
df=pd.read_csv(io.StringIO('\n'.join(split_lines)), delim_whitespace=True)
给了我一个26列的怪异数据框。我意识到给定的字符串在不规则位置处有spaces
。是否可以从字符串中提取感兴趣的数据并创建一个数据帧,如上图所示?
答案 0 :(得分:1)
我最近有一个非常相似的问题。起初看起来很复杂,但是如果每行中具有相同数量的变量,构造正则表达式实际上非常容易。
import re
import pandas as pd
data = """
[15:07:29] (+?.?????????) host_name data: { cpu_id = 0 }, { var1 = 3, var2 = 4, var3 = 30, var4 = 87.7187 }
[15:07:30] (+0:0:1) host_name data: { cpu_id = 0 }, { var1 = 4, var2 = 4, var3 = 29, var4 = 0.073525 }
"""
def try_convert(s):
try:
v = float(s)
except ValueError:
v = s
return v
def parse_data_string(s):
regex = r"\[(\d.:\d.:\d.)\] \((.*)\) (\w+) data: { (\w+ = ([-+]?\d*\.\d+|\d+),*) }, { (\w+ = ([-+]?\d*\.\d+|\d+)), (\w+ = ([-+]?\d*\.\d+|\d+)), (\w+ = ([-+]?\d*\.\d+|\d+)), (\w+ = ([-+]?\d*\.\d+|\d+),? )}"
matches = re.finditer(regex, s, re.MULTILINE)
for match in matches:
groups = list(match.groups())
row = [try_convert(groups[i]) for i in {0,1,4,6,8,10,12}]
yield row
df = pd.DataFrame(parse_data_string(data))
print(df)
print(df.dtypes)
我使用([-+]?\d*\.\d+|\d+)
来匹配数字值,其余的正则表达式很简单。
转换为适当的类型可能会做得更好。