替换为dataframe.iterrows()

时间:2018-12-13 11:05:41

标签: python pandas numpy

我正在研究一个脚本,用于将数据从MongoDB迁移到Clickhouse。由于嵌套结构在Clickhouse中实现的效果不够好,因此,我对嵌套结构进行了迭代,并将其呈现为平面表示形式,其中嵌套结构的每个元素在Clickhouse数据库中都是不同的行。

我要做的是遍历字典列表并获取目标值。结构如下:

[
 {
  'Comment': None,
  'Details': None,
  'FunnelId': 'MegafonCompany',
  'IsHot': False,
  'IsReadonly': False,
  'Name': 'Новый',
  'SetAt': datetime.datetime(2018, 4, 20, 10, 39, 55, 475000),
  'SetById': 'ekaterina.karpenko',
  'SetByName': 'Екатерина Карпенко',
  'Stage': {
            'Label': 'Новые',
            'Order': 0,
            '_id': 'newStage'
           },
  'Tags': None,
  'Type': 'Unknown',
  'Weight': 120,
  '_id': 'new'
 },
 {
  'Comment': None,
  'Details': {
              'Name': 'взят в работу', 
              '_id': 1
             },
  'FunnelId': 'MegafonCompany',
  'IsHot': False,
  'IsReadonly': False,
  'Name': 'В работе',
  'SetAt': datetime.datetime(2018, 4, 20, 10, 40, 4, 841000),
  'SetById': 'ekaterina.karpenko',
  'SetByName': 'Екатерина Карпенко',
  'Stage': {
            'Label': 'Приглашение на интервью',
            'Order': 1,
            '_id': 'recruiterStage'
           },
  'Tags': None,
  'Type': 'InProgress',
  'Weight': 80,
  '_id': 'phoneInterview'
 }
]

我有一个通过data.iterrows()方法对dataframe对象执行此操作的函数:

def to_flat(data, coldict, field_last_upd):

m_status_history = stc.special_mongo_names['status_history_cand']
n_statuse_change = coldict['n_statuse_change']['name']

data[n_statuse_change] = n_status_change(dp.force_take_series(data, m_status_history))
flat_cols = [ x for x in coldict.values() if x['coltype'] == stc.COLTYPE_FLAT ]
old_cols_names = [ x['name'] for x in coldict.values() if x['coltype'] == stc.COLTYPE_PREPARATION ]
t_time = time.time()
t_len = 0
new_rows = list()

    for j in range(row[n_statuse_change]):
        t_new_value_row = np.empty(shape=[0, 0])
        for k in range(len(flat_cols)):
            if flat_cols[k]['colsubtype'] == stc.COLSUBTYPE_FLATPATH:
                new_value = dp.under_value_line(
                    row,
                    path_for_status(j, row[n_statuse_change]-1, flat_cols[k]['path'])
                )
                # Дополнительно обрабатываем дату
                if flat_cols[k]['name'] == coldict['status_set_at']['name']:
                    new_value = dp.iso_date_to_datetime(new_value)

                if flat_cols[k]['name'] == coldict['status_set_at_mil']['name']:
                    new_value = dp.iso_date_to_miliseconds(new_value)

                if flat_cols[k]['name'] == coldict['status_stage_order']['name']:
                    try:
                        new_value = int(new_value)
                    except:
                        new_value = new_value
            else:
                if flat_cols[k]['name'] == coldict['status_index']['name']:
                    new_value = j

            t_new_value_row = np.append(t_new_value_row, dp.some_to_null(new_value))
        new_rows.append(np.append(row[old_cols_names].values, t_new_value_row))
pdb.set_trace()
res = pd.DataFrame(new_rows, columns = [
    x['name'] for x in coldict.values() if x['coltype'] == stc.COLTYPE_FLAT or x['coltype'] == stc.COLTYPE_PREPARATION
])

return res

它从字典列表中获取值,使用numpy数组将其准备为符合Clickhouse的要求,然后将它们全部附加在一起以获取具有目标值及其列名的新数据框。

我注意到,如果嵌套结构足够大,它将开始工作得慢得多。我找到了一篇文章,比较了Python中不同的迭代方法。 article

据称,通过.apply()方法进行迭代的速度要快得多,而使用矢量化处理的速度甚至更快。但是给出的样本非常琐碎,并且依赖于对所有值使用相同的函数。在对不同类型的数据使用各种功能的同时,是否可以以更快的方式遍历熊猫对象?

1 个答案:

答案 0 :(得分:0)

我认为您的第一步应该是将数据转换为熊猫数据框,这样处理起来会容易得多。我无法对想要运行的确切功能进行断屑处理,但也许我的示例有所帮助

import datetime
import pandas as pd

data_dict_array = [
 {
  'Comment': None,
  'Details': None,
  'FunnelId': 'MegafonCompany',
  'IsHot': False,
  'IsReadonly': False,
  'Name': 'Новый',
  'SetAt': datetime.datetime(2018, 4, 20, 10, 39, 55, 475000),
  'SetById': 'ekaterina.karpenko',
  'SetByName': 'Екатерина Карпенко',
  'Stage': {
            'Label': 'Новые',
            'Order': 0,
            '_id': 'newStage'
           },
  'Tags': None,
  'Type': 'Unknown',
  'Weight': 120,
  '_id': 'new'
 },
 {
  'Comment': None,
  'Details': {
              'Name': 'взят в работу', 
              '_id': 1
             },
  'FunnelId': 'MegafonCompany',
  'IsHot': False,
  'IsReadonly': False,
  'Name': 'В работе',
  'SetAt': datetime.datetime(2018, 4, 20, 10, 40, 4, 841000),
  'SetById': 'ekaterina.karpenko',
  'SetByName': 'Екатерина Карпенко',
  'Stage': {
            'Label': 'Приглашение на интервью',
            'Order': 1,
            '_id': 'recruiterStage'
           },
  'Tags': None,
  'Type': 'InProgress',
  'Weight': 80,
  '_id': 'phoneInterview'
 }
]

#converting your data into something pandas can read
# in particular, flattening the stage dict
for data_dict in data_dict_array:
    d_temp = data_dict.pop("Stage")
    data_dict["Stage_Label"] = d_temp["Label"]
    data_dict["Stage_Order"] = d_temp["Order"]
    data_dict["Stage_id"] = d_temp["_id"]

df = pd.DataFrame(data_dict_array)

# lets say i want to set comment to "cool" if name is 'В работе'
# in .loc[], the first argument is filtering the rows, the second argument is picking the column
df.loc[df['Name'] == 'В работе','Comment'] = "cool"
df