我正在使用看起来像这样的Pandas(版本0.17.1)DataFrame:
time type module msg_type content
36636 2016-08-25 17:59:50.051 INFO MOD_1_NAME STATUS Received Status Monitoring from MODULE_1 'Property A' = some_value_1
36637 2016-08-25 17:59:50.051 INFO MOD_1_NAME STATUS Received Status Monitoring from MODULE_1 'Property B' = some_value_2
36638 2016-08-25 17:59:50.051 INFO MOD_1_NAME STATUS Received Status Monitoring from MODULE_1 'Property C' = some_value_3
36639 2016-08-25 17:59:50.051 INFO MOD_1_NAME STATUS Received Status Monitoring from MODULE_1 'Property D' = some_value_4
36715 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 1' = some_value_a
36716 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 2' = some_value_b
36717 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 3' = some_value_c
36718 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 4' = some_value_d
36719 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 5' = some_value_e
36720 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 6' = some_value_f
36721 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 7' = some_value_g
36722 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 8' = some_value_h
36723 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 9' = some_value_i
36724 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 10' = some_value_j
36725 2016-08-25 17:59:50.964 ERROR MOD_2_NAME STATUS Didn't receive Status Monitoring 'Parameter 11' from MODULE_2!
36726 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 12' = some_value_k
36727 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 13' = some_value_l
36785 2016-08-25 18:59:50.051 INFO MOD_1_NAME STATUS Received Status Monitoring from MODULE_1 'Property A' = some_value_1
36786 2016-08-25 18:59:50.051 INFO MOD_1_NAME STATUS Received Status Monitoring from MODULE_1 'Property B' = some_value_2
36787 2016-08-25 18:59:50.051 INFO MOD_1_NAME STATUS Received Status Monitoring from MODULE_1 'Property C' = some_value_3
36788 2016-08-25 18:59:50.051 INFO MOD_1_NAME STATUS Received Status Monitoring from MODULE_1 'Property D' = some_value_4
36827 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 1' = some_value_a
36828 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 2' = some_value_b
36829 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 3' = some_value_c
36830 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 4' = some_value_d
36831 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 5' = some_value_e
36832 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 6' = some_value_f
36833 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 7' = some_value_g
36834 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 8' = some_value_h
36835 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 9' = some_value_i
36836 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 10' = some_value_j
36837 2016-08-25 19:01:50.964 ERROR MOD_2_NAME STATUS Didn't receive Status Monitoring 'Parameter 11' from MODULE_2!
36838 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 12' = some_value_k
36839 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS Received Status Monitoring from MODULE_2 'Parameter 13' = some_value_l
(框架已经缩小以删除不感兴趣的行。这就是索引列缺少数字的原因)
如您所见,同时从设备读取多个参数。每个阅读都是一个单独的行。我想做一些"减少"和"压缩"这样每个读数只有一行。我还希望content
列成为字典,以便我可以轻松查找感兴趣的特定项目。所以结果看起来像这样:
time type module msg_type content
36636 2016-08-25 17:59:50.051 INFO MOD_1_NAME STATUS {'Property A' = 'some_value_1', 'Property B' = 'some_value_2', 'Property C' = 'some_value_3', 'Property D' = 'some_value_4'}
36715 2016-08-25 17:59:50.964 INFO MOD_2_NAME STATUS {'Parameter 1' = 'some_value_a', 'Parameter 2' = 'some_value_b', 'Parameter 3' = 'some_value_c', 'Parameter 4' = 'some_value_d', 'Parameter 5' = 'some_value_e', 'Parameter 6' = 'some_value_f', 'Parameter 7' = 'some_value_g','Parameter 8' = some_value_h, 'Parameter 9' = 'some_value_i', 'Parameter 10' = 'some_value_j', 'Parameter 11' = '', 'Parameter 12' = 'some_value_k', 'Parameter 13' = 'some_value_l'}
36785 2016-08-25 18:59:50.051 INFO MOD_1_NAME STATUS {'Property A' = 'some_value_1', 'Property B' = 'some_value_2', 'Property C' = 'some_value_3', 'Property D' = 'some_value_4'}
36827 2016-08-25 19:01:50.964 INFO MOD_2_NAME STATUS {'Parameter 1' = 'some_value_a', 'Parameter 2' = 'some_value_b', 'Parameter 3' = 'some_value_c', 'Parameter 4' = 'some_value_d', 'Parameter 5' = 'some_value_e', 'Parameter 6' = 'some_value_f', 'Parameter 7' = 'some_value_g','Parameter 8' = some_value_h, 'Parameter 9' = 'some_value_i', 'Parameter 10' = 'some_value_j', 'Parameter 11' = '', 'Parameter 12' = 'some_value_k', 'Parameter 13' = 'some_value_l'}
所以基本上我希望time
和module
列具有相同值的所有行都是"合并"将contents
列解析为字典。 (也可能有一些"缺失"或"空的"读数。)我不想过滤或删除数据,只是简化并总结它。
我猜我需要groupby()
,transform()
和apply()
的某种组合,但我不确定从哪里开始。
我的部分困难在于,我无法检查groupby()
的结果,看看它是否正在按我的意愿行事。
g1 = df.groupby(['module', 'time'])
g1
未显示在Spyder变量资源管理器中。 print
没有显示任何内容。我无法在index
上访问属性info()
或致电g1
。但我怀疑groupby()
在这里是否值得......我不想消除任何东西。
正在做一些搜索以找到一个例子,但继续得到似乎误报的东西。任何开始的帮助将不胜感激。
答案 0 :(得分:2)
pv = df.set_index(['time', 'type', 'module', 'msg_type']) \
.content.str.extract(r"'(?P<prop>.+)' = (?P<val>.+)", expand=True)
pv.groupby(level=[0, 2]).apply(lambda df: df.set_index('prop').val.to_dict())
2016-08-25 17:59:50.051,MOD_1_NAME,"{'Property A': 'some_value_1', 'Property C': 'some_value_3', 'Property B': 'some_value_2', 'Property D': 'some_value_4'}"
2016-08-25 17:59:50.964,MOD_2_NAME,"{'Parameter 6': 'some_value_f', 'Parameter 7': 'some_value_g', 'Parameter 4': 'some_value_d', 'Parameter 5': 'some_value_e', 'Parameter 2': 'some_value_b', 'Parameter 3': 'some_value_c', 'Parameter 1': 'some_value_a', 'Parameter 8': 'some_value_h', 'Parameter 9': 'some_value_i', 'Parameter 10': 'some_value_j', 'Parameter 12': 'some_value_k', 'Parameter 13': 'some_value_l'}"
2016-08-25 18:59:50.051,MOD_1_NAME,"{'Property A': 'some_value_1', 'Property C': 'some_value_3', 'Property B': 'some_value_2', 'Property D': 'some_value_4'}"
2016-08-25 19:01:50.964,MOD_2_NAME,"{'Parameter 6': 'some_value_f', 'Parameter 7': 'some_value_g', 'Parameter 4': 'some_value_d', 'Parameter 5': 'some_value_e', 'Parameter 2': 'some_value_b', 'Parameter 3': 'some_value_c', 'Parameter 1': 'some_value_a', 'Parameter 8': 'some_value_h', 'Parameter 9': 'some_value_i', 'Parameter 10': 'some_value_j', 'Parameter 12': 'some_value_k', 'Parameter 13': 'some_value_l'}"
答案 1 :(得分:1)
In [235]: def create_data_dict(rows):
...: return {k:v for k,v in re.findall(r"'([^']*)' = ([^ ]*)", ' '.join(rows.content.astype(str)))}
...:
In [236]: df[df['type'] != 'ERROR'].groupby(['time', 'module', 'msg_type']).apply(create_data_dict).to_frame(name = 'content').reset_index()
Out[236]:
time module msg_type content
0 2016-08-25 17:59:50.051 MOD_1_NAME STATUS {u'Property A': u'some_value_1', u'Property C': u'some_value_3', u'Property B': u'some_value_2', u'Property D': u'some_value_4'}
1 2016-08-25 17:59:50.964 MOD_2_NAME STATUS {u'Parameter 6': u'some_value_f', u'Parameter 7': u'some_value_g', u'Parameter 4': u'some_value_d', u'Parameter 5': u'some_value_e', u'Parameter 2': u'some_value_b', u'Parameter 3': u'some_value_c', u'Parameter 1': u'some_value_a', u'Parameter 8': u'some_value_h', u'Parameter 9': u'some_value_i', u'Parameter 10': u'some_value_j', u'Parameter 12': u'some_value_k', u'Parameter 13': u'some_value_l'}
2 2016-08-25 18:59:50.051 MOD_1_NAME STATUS {u'Property A': u'some_value_1', u'Property C': u'some_value_3', u'Property B': u'some_value_2', u'Property D': u'some_value_4'}
3 2016-08-25 19:01:50.964 MOD_2_NAME STATUS {u'Parameter 6': u'some_value_f', u'Parameter 7': u'some_value_g', u'Parameter 4': u'some_value_d', u'Parameter 5': u'some_value_e', u'Parameter 2': u'some_value_b', u'Parameter 3': u'some_value_c', u'Parameter 1': u'some_value_a', u'Parameter 8': u'some_value_h', u'Parameter 9': u'some_value_i', u'Parameter 10': u'some_value_j', u'Parameter 12': u'some_value_k', u'Parameter 13': u'some_value_l'}
答案 2 :(得分:1)
为了了解熊猫中的群组,您应该查看http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-object-attributes。另一种了解组的方法是简单地打印它们:
grouped = df.groupby(['A', 'B'])
print grouped.first() # prints the first group
# print each (name, group) tuple from grouped
for name, grp in grouped:
print name
print grp
我根据我做出的一些假设(见下面的注释)为您制定了一个特定的解决方案:
import re
from collections import OrderedDict
df = pd.read_csv('/Users/shawnheide/Desktop/test.csv')
def custom_agg(contents):
this_dict = OrderedDict()
for content in contents:
match = re.findall("Property \w+|Parameter \d+", content)
if match:
key = match[0]
match = re.findall("some_value_\w+|some_value_\d+", content)
if match:
value = match[0]
else:
value = ''
this_dict[key] = value
return this_dict
grps = df.groupby(['time', 'module', ], as_index=False)
df_grp = grps.agg({'content': custom_agg})
输出:
time module content
0 2016-08-25 17:59:50.051 MOD_1_NAME {'Property A': 'some_value_1', 'Property B': 'some_value_2', 'Property C': 'some_value_3', 'Property D': 'some_value_4'}
1 2016-08-25 17:59:50.964 MOD_2_NAME {'Parameter 1': 'some_value_a', 'Parameter 2': 'some_value_b', 'Parameter 3': 'some_value_c', 'Parameter 4': 'some_value_d', 'Parameter 5': 'some_value_e', 'Parameter 6': 'some_value_f', 'Parameter 7': 'some_value_g', 'Parameter 8': 'some_value_h', 'Parameter 9': 'some_value_i', 'Parameter 10': 'some_value_j', 'Parameter 11': '', 'Parameter 12': 'some_value_k', 'Parameter 13': 'some_value_l'}
2 2016-08-25 18:59:50.051 MOD_1_NAME {'Property A': 'some_value_1', 'Property B': 'some_value_2', 'Property C': 'some_value_3', 'Property D': 'some_value_4'}
3 2016-08-25 19:01:50.964 MOD_2_NAME {'Parameter 1': 'some_value_a', 'Parameter 2': 'some_value_b', 'Parameter 3': 'some_value_c', 'Parameter 4': 'some_value_d', 'Parameter 5': 'some_value_e', 'Parameter 6': 'some_value_f', 'Parameter 7': 'some_value_g', 'Parameter 8': 'some_value_h', 'Parameter 9': 'some_value_i', 'Parameter 10': 'some_value_j', 'Parameter 11': '', 'Parameter 12': 'some_value_k', 'Parameter 13': 'some_value_l'}
需要考虑的问题:
因此,首先,您应该以其他人可以阅读的格式(即csv,tsv等)发布您的数据,这使得其他人更容易导入并帮助您解决问题。
第二个问题是,在您提出的解决方案中,您有索引和msg_type列。鉴于您没有对这些列进行分组,这实际上没有意义,但实际上它只是需要考虑的事情。
最后,为了获得一个有序字典,你需要使用集合中的OrderedDict模块,因为Python dicts不维护顺序(手指越过这个特性将在3.6中出现)。