熊猫数据减少和合并

时间:2016-09-10 00:26:23

标签: python pandas reduction

我正在使用看起来像这样的Pandas(版本0.17.1)DataFrame:

                         time   type   module     msg_type         content
36636 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property A' = some_value_1
36637 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property B' = some_value_2
36638 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property C' = some_value_3
36639 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property D' = some_value_4
36715 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 1' = some_value_a
36716 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 2' = some_value_b
36717 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 3' = some_value_c
36718 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 4' = some_value_d
36719 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 5' = some_value_e
36720 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 6' = some_value_f
36721 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 7' = some_value_g
36722 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 8' = some_value_h
36723 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 9' = some_value_i
36724 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 10' = some_value_j
36725 2016-08-25 17:59:50.964  ERROR   MOD_2_NAME  STATUS  Didn't receive Status Monitoring 'Parameter 11' from MODULE_2!
36726 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 12' = some_value_k
36727 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 13' = some_value_l
36785 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property A' = some_value_1
36786 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property B' = some_value_2
36787 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property C' = some_value_3
36788 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property D' = some_value_4
36827 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 1' = some_value_a
36828 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 2' = some_value_b
36829 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 3' = some_value_c
36830 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 4' = some_value_d
36831 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 5' = some_value_e
36832 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 6' = some_value_f
36833 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 7' = some_value_g
36834 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 8' = some_value_h
36835 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 9' = some_value_i
36836 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 10' = some_value_j
36837 2016-08-25 19:01:50.964  ERROR   MOD_2_NAME  STATUS  Didn't receive Status Monitoring 'Parameter 11' from MODULE_2!
36838 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 12' = some_value_k
36839 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 13' = some_value_l

(框架已经缩小以删除不感兴趣的行。这就是索引列缺少数字的原因)

如您所见,同时从设备读取多个参数。每个阅读都是一个单独的行。我想做一些"减少"和"压缩"这样每个读数只有一行。我还希望content列成为字典,以便我可以轻松查找感兴趣的特定项目。所以结果看起来像这样:

                         time   type   module     msg_type         content
36636 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  {'Property A' = 'some_value_1', 'Property B' = 'some_value_2', 'Property C' = 'some_value_3', 'Property D' = 'some_value_4'}
36715 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  {'Parameter 1' = 'some_value_a', 'Parameter 2' = 'some_value_b', 'Parameter 3' = 'some_value_c', 'Parameter 4' = 'some_value_d', 'Parameter 5' = 'some_value_e', 'Parameter 6' = 'some_value_f', 'Parameter 7' = 'some_value_g','Parameter 8' = some_value_h, 'Parameter 9' = 'some_value_i', 'Parameter 10' = 'some_value_j', 'Parameter 11' = '', 'Parameter 12' = 'some_value_k', 'Parameter 13' = 'some_value_l'}
36785 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  {'Property A' = 'some_value_1', 'Property B' = 'some_value_2', 'Property C' = 'some_value_3', 'Property D' = 'some_value_4'}
36827 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  {'Parameter 1' = 'some_value_a', 'Parameter 2' = 'some_value_b', 'Parameter 3' = 'some_value_c', 'Parameter 4' = 'some_value_d', 'Parameter 5' = 'some_value_e', 'Parameter 6' = 'some_value_f', 'Parameter 7' = 'some_value_g','Parameter 8' = some_value_h, 'Parameter 9' = 'some_value_i', 'Parameter 10' = 'some_value_j', 'Parameter 11' = '', 'Parameter 12' = 'some_value_k', 'Parameter 13' = 'some_value_l'}

所以基本上我希望timemodule列具有相同值的所有行都是"合并"将contents列解析为字典。 (也可能有一些"缺失"或"空的"读数。)我不想过滤或删除数据,只是简化并总结它。

我猜我需要groupby()transform()apply()的某种组合,但我不确定从哪里开始。

我的部分困难在于,我无法检查groupby()的结果,看看它是否正在按我的意愿行事。

g1 = df.groupby(['module', 'time'])

g1未显示在Spyder变量资源管理器中。 print没有显示任何内容。我无法在index上访问属性info()或致电g1。但我怀疑groupby()在这里是否值得......我不想消除任何东西。

正在做一些搜索以找到一个例子,但继续得到似乎误报的东西。任何开始的帮助将不胜感激。

3 个答案:

答案 0 :(得分:2)

pv = df.set_index(['time', 'type', 'module', 'msg_type']) \
       .content.str.extract(r"'(?P<prop>.+)' = (?P<val>.+)", expand=True)

pv.groupby(level=[0, 2]).apply(lambda df: df.set_index('prop').val.to_dict())
2016-08-25 17:59:50.051,MOD_1_NAME,"{'Property A': 'some_value_1', 'Property C': 'some_value_3', 'Property B': 'some_value_2', 'Property D': 'some_value_4'}"
2016-08-25 17:59:50.964,MOD_2_NAME,"{'Parameter 6': 'some_value_f', 'Parameter 7': 'some_value_g', 'Parameter 4': 'some_value_d', 'Parameter 5': 'some_value_e', 'Parameter 2': 'some_value_b', 'Parameter 3': 'some_value_c', 'Parameter 1': 'some_value_a', 'Parameter 8': 'some_value_h', 'Parameter 9': 'some_value_i', 'Parameter 10': 'some_value_j', 'Parameter 12': 'some_value_k', 'Parameter 13': 'some_value_l'}"
2016-08-25 18:59:50.051,MOD_1_NAME,"{'Property A': 'some_value_1', 'Property C': 'some_value_3', 'Property B': 'some_value_2', 'Property D': 'some_value_4'}"
2016-08-25 19:01:50.964,MOD_2_NAME,"{'Parameter 6': 'some_value_f', 'Parameter 7': 'some_value_g', 'Parameter 4': 'some_value_d', 'Parameter 5': 'some_value_e', 'Parameter 2': 'some_value_b', 'Parameter 3': 'some_value_c', 'Parameter 1': 'some_value_a', 'Parameter 8': 'some_value_h', 'Parameter 9': 'some_value_i', 'Parameter 10': 'some_value_j', 'Parameter 12': 'some_value_k', 'Parameter 13': 'some_value_l'}"

答案 1 :(得分:1)

定义一个函数并使用groupby()然后apply()

In [235]: def create_data_dict(rows):
     ...:     return {k:v for k,v in re.findall(r"'([^']*)' = ([^ ]*)", ' '.join(rows.content.astype(str)))}
     ...: 

In [236]: df[df['type'] != 'ERROR'].groupby(['time', 'module', 'msg_type']).apply(create_data_dict).to_frame(name = 'content').reset_index()
Out[236]: 
                      time      module msg_type                                                                                                                                                                                                                                                                                                                                                                                                          content
0  2016-08-25 17:59:50.051  MOD_1_NAME   STATUS                                                                                                                                                                                                                                                                                 {u'Property A': u'some_value_1', u'Property C': u'some_value_3', u'Property B': u'some_value_2', u'Property D': u'some_value_4'}
1  2016-08-25 17:59:50.964  MOD_2_NAME   STATUS  {u'Parameter 6': u'some_value_f', u'Parameter 7': u'some_value_g', u'Parameter 4': u'some_value_d', u'Parameter 5': u'some_value_e', u'Parameter 2': u'some_value_b', u'Parameter 3': u'some_value_c', u'Parameter 1': u'some_value_a', u'Parameter 8': u'some_value_h', u'Parameter 9': u'some_value_i', u'Parameter 10': u'some_value_j', u'Parameter 12': u'some_value_k', u'Parameter 13': u'some_value_l'}
2  2016-08-25 18:59:50.051  MOD_1_NAME   STATUS                                                                                                                                                                                                                                                                                 {u'Property A': u'some_value_1', u'Property C': u'some_value_3', u'Property B': u'some_value_2', u'Property D': u'some_value_4'}
3  2016-08-25 19:01:50.964  MOD_2_NAME   STATUS  {u'Parameter 6': u'some_value_f', u'Parameter 7': u'some_value_g', u'Parameter 4': u'some_value_d', u'Parameter 5': u'some_value_e', u'Parameter 2': u'some_value_b', u'Parameter 3': u'some_value_c', u'Parameter 1': u'some_value_a', u'Parameter 8': u'some_value_h', u'Parameter 9': u'some_value_i', u'Parameter 10': u'some_value_j', u'Parameter 12': u'some_value_k', u'Parameter 13': u'some_value_l'}

答案 2 :(得分:1)

为了了解熊猫中的群组,您应该查看http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-object-attributes。另一种了解组的方法是简单地打印它们:

grouped = df.groupby(['A', 'B'])
print grouped.first() # prints the first group

# print each (name, group) tuple from grouped
for name, grp in grouped:
    print name
    print grp

我根据我做出的一些假设(见下面的注释)为您制定了一个特定的解决方案:

import re
from collections import OrderedDict

df = pd.read_csv('/Users/shawnheide/Desktop/test.csv')

def custom_agg(contents):
    this_dict = OrderedDict()
    for content in contents:
        match = re.findall("Property \w+|Parameter \d+", content)
        if match:
            key = match[0]
            match = re.findall("some_value_\w+|some_value_\d+", content)
            if match:
                value = match[0]
            else:
                value = ''
        this_dict[key] = value
    return this_dict

grps = df.groupby(['time', 'module', ], as_index=False)
df_grp = grps.agg({'content': custom_agg})

输出:

time    module  content
0   2016-08-25 17:59:50.051 MOD_1_NAME  {'Property A': 'some_value_1', 'Property B': 'some_value_2', 'Property C': 'some_value_3', 'Property D': 'some_value_4'}
1   2016-08-25 17:59:50.964 MOD_2_NAME  {'Parameter 1': 'some_value_a', 'Parameter 2': 'some_value_b', 'Parameter 3': 'some_value_c', 'Parameter 4': 'some_value_d', 'Parameter 5': 'some_value_e', 'Parameter 6': 'some_value_f', 'Parameter 7': 'some_value_g', 'Parameter 8': 'some_value_h', 'Parameter 9': 'some_value_i', 'Parameter 10': 'some_value_j', 'Parameter 11': '', 'Parameter 12': 'some_value_k', 'Parameter 13': 'some_value_l'}
2   2016-08-25 18:59:50.051 MOD_1_NAME  {'Property A': 'some_value_1', 'Property B': 'some_value_2', 'Property C': 'some_value_3', 'Property D': 'some_value_4'}
3   2016-08-25 19:01:50.964 MOD_2_NAME  {'Parameter 1': 'some_value_a', 'Parameter 2': 'some_value_b', 'Parameter 3': 'some_value_c', 'Parameter 4': 'some_value_d', 'Parameter 5': 'some_value_e', 'Parameter 6': 'some_value_f', 'Parameter 7': 'some_value_g', 'Parameter 8': 'some_value_h', 'Parameter 9': 'some_value_i', 'Parameter 10': 'some_value_j', 'Parameter 11': '', 'Parameter 12': 'some_value_k', 'Parameter 13': 'some_value_l'}

需要考虑的问题:

因此,首先,您应该以其他人可以阅读的格式(即csv,tsv等)发布您的数据,这使得其他人更容易导入并帮助您解决问题。

第二个问题是,在您提出的解决方案中,您有索引和msg_type列。鉴于您没有对这些列进行分组,这实际上没有意义,但实际上它只是需要考虑的事情。

最后,为了获得一个有序字典,你需要使用集合中的OrderedDict模块,因为Python dicts不维护顺序(手指越过这个特性将在3.6中出现)。