在我们正在进行的项目中,我们遇到的日志文件的每一行都具有以下结构:
2012-01-02,12:50:32,658,2,1,2,0,0,0,0,1556,1555,62,60,2,3,0,0,0,0,1559 ,1557,1557,63,64,65,0.305,0.265,0.304,0.308,0.309
字符串的结构嵌入在字符串本身中。
首先我们有一些元数据:
然后依次是每个组的数据。
组数据具有以下结构(下面以测量组1为例):
该行继续连续上述所有传感器的计算值(即不再有控制值或原始值)
在该示例中,传感器的总数= 2 + 3 = 5,因此计算的线为:
0.305,0.265,0.304,0.308,0.309
我的问题是: 如果我们想要像这样标准化每个传感器的值:
日期,时间,测量组数,组中的数字传感器,(原始值类型1,原始值类型2,计算值)
什么是灵活的解决方案,因为每个变量的任何日期时间都很好......变量(意味着测量组的数量是可变的,并且每个组中的传感器数量也可以变化?< / p>
对于示例,最终输出应该是:
到目前为止,我所做的是将测量值随时间分段并定义“静态”定义哪些列将被插入属于一个案例的行,一个传感器所属的组,它的感光器是什么。 ..
这不是一个好的解决方案,因为测量设置中的每个更改都会导致代码发生更多变化。
line="""2012-01-02,12:50:32,658,2,1,2,0,0,0,0,1556,1555,62,60,2,3,0,0,0,0,1559,1557,1557,63,64,65,0.305,0.265,0.304,0.308,0.309"""
parts=line.split(",")
date=parts[0]
groupnames=[1,1,2,2,2]
sensornumbers=[1,2,1,2,3]
raw_type1_idx=[10,11,20,21,22]
raw_type2_idx=[12,13,23,24,25]
calc_idx=[26,27,28,29,30]
for i,j,k,l,m in zip(groupnames,sensornumbers,raw_type1_idx,raw_type2_idx,calc_idx):
output_tpl= parts[k],parts[l],parts[m]
print "%s,%s,%s,%s" % (date,i,j,output_tpl)
有没有更好的Python方式来做这样的事情?
答案 0 :(得分:2)
不是一个特别好的数据结构。假设总有4个控制值,则以下内容应适用于任意数量的组和传感器。
sample = "2012-01-02,12:50:32,658,2,1,2,0,0,0,0,1556,1555,62,60,2,3,0,0,0,0,1559,1557,1557,63,64,65,0.305,0.265,0.304,0.308,0.309"
def parse_line(line):
line = line.split(',')
sensors = []
date = line[0]
time = line[1]
row = line[2]
groups = int(line[3])
c = 4
for i in range(groups):
group_num = line[c]
sensor_count = int(line[c+1])
sensor_data_len = 4 + sensor_count * 2
sensor_data = line[c+2+4:c+2+sensor_data_len]
c += 2 + sensor_data_len
for j in range(sensor_count):
sensors.append([group_num,str(j+1)] + sensor_data[j::sensor_count])
for s,v in zip(sensors,line[c:]):
s.append(v)
# Now have a list of lists, one per sensor sensor containing all the data
for s in sensors:
print ",".join([date,time]+s)
parse_line(sample)
产量:
2012-01-02,12:50:32,1,1,1556,62,0.305
2012-01-02,12:50:32,1,2,1555,60,0.265
2012-01-02,12:50:32,2,1,1559,63,0.304
2012-01-02,12:50:32,2,2,1557,64,0.308
2012-01-02,12:50:32,2,3,1557,65,0.309
答案 1 :(得分:1)
这是一项非常重要的任务。可能最“pythonic”的方式是创建一个类。
我冒昧地和时间做了一个例子:
from collections import namedtuple
class DataPack(object):
def __init__(self, line, seperator =',', headerfields = None, groupfields = None):
self.seperator = seperator
self.header_fields = headerfields or ('date', 'time', 'nr', 'groups')
self.group_fields = groupfields or ('nr', 'sensors','controlfields',
't1values', 't2values')
Header = namedtuple('Header', self.header_fields)
self.header_part = line.split(self.seperator)[:self.data_start]
self.data_start = len(self.header_fields)
self.data_part = line.split(self.seperator)[self.data_start:]
self.header = Header(*self.header_part)
self.groups = self._create_groups(self.data_part, self.header.groups)
def _create_groups(self, datalst, groups):
"""nr, sensors controllfield * 4, t1value*sensors, t2value*sensors """
Group = namedtuple('DataGroup', self.group_fields)
_groups = []
for i in range(int(groups)):
nr = datalst[0]
sensors = datalst[1]
controlfields = datalst [2:6]
t1values = datalst[6:6+int(sensors)]
t2values = datalst[6+int(sensors):6+int(sensors)*2]
_groups.append(Group(nr, sensors, controlfields, t1values, t2values))
datalst = datalst[6+int(sensors)*2:]
return _groups
def __str__(self):
_return = []
for group in self.groups:
for sensor in range(int(group.sensors)):
_return.append('%s, ' % self.header.date.replace('-','/'))
_return.append('%s, ' % self.header.time)
_return.append('%s, ' % group.nr)
_return.append('%s, ' % (int(sensor) + 1,))
_return.append('(%s, ' % group.t1values[int(sensor)])
_return.append('%s)\n' % group.t2values[int(sensor)])
return u''.join(_return)
if __name__ == '__main__':
line = """2012-01-02,12:50:32,658,2,1,2,0,0,0,0,1556,1555,62,60,2,3,0,0,0,0,1559,1557,1557,63,64,65,0.305,0.265,0.304,0.308,0.309"""
data = DataPack(line)
for i in data.header: print i,
for i in data.groups: print '\n',i
print '\n',data
print 'cfield 0:2 ', data.groups[0].controlfields[2]
print 't2value 1:2 ', data.groups[1].t2values[2]
对输入数据进行更大的更改,您必须子类化并覆盖_create_groups
和__str__
方法。