爆炸多列的Python DataFrame

时间:2020-05-10 12:39:09

标签: python-3.x dataframe

需要爆炸DataFrame。已使用this code from MaxU。 但是有些错误给了我无法修复的错误。其赋予价值的错误。 我已经从json文件加载了数据。 json结构良好。 我的DF结构:

  CAMP_ID          DUMP_DATE          action         code                                               name price_amount rev_cat reward_amount    reward_names                                  reward_type reward_validity validity
0  158034  20200415 17:30:16            USSD  *21291*189#  Mar20_ACLM_D_USSD_Tk189_8GB_30D_MdUsrNCCN_23-1...          189      Dt        [8192]  [Airtel Bonus]         [Debit Revenue Info, ADCS Data Pack]            [30]       28
1  158132  20200415 17:30:16  Store Recharge          NaN                 3.5GB 28Days @TK209 Bkash Recharge          209     NaN            []              []                       [No Reward, No Reward]              []        -
2  158056  20200415 17:30:17         Monitor          NaN                                       default-name            9      DT            []    [Robi Bonus]  [ADCS Hourly Data pack, Debit Revenue Info]              []        1
3  158041  20200415 17:30:16  Store Recharge          NaN                    50p(+tax)/sec, 7day @ 21tk load           21     NaN            []              []                       [No Reward, No Reward]              []        -
4  158090  20200415 17:30:16  Store Recharge          NaN                       50MB bonus(3days) @22tk load           22     NaN            []              []                       [No Reward, No Reward]              []        -

代码:

#!/usr/local/bin/python3.6

def explode(df, lst_cols, fill_value='', preserve_index=False):
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values
    idx = np.repeat(df.index.values, lens)
    # create "exploded" DF
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx).assign(**{col:np.concatenate(df.loc[lens>0, col].values) for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:
        res = res.reset_index(drop=True)
    return res


if __name__=="__main__":
        with open('/edatamsc/sa_msc/data/icms_campaign_dump_reward_list.json','r')  as rfh:
                data = json.load(rfh)
                df = pd.DataFrame(data)
                lst = ['reward_type','reward_amount','reward_names','reward_validity']
                df=explode(df,lst,fill_value='',preserve_index=True)
                print(df)
                with open('csv_data.psv','w') as wfh:
                        df.to_csv(wfh,sep='|', encoding='utf-8')

错误日志

   Traceback (most recent call last):
    File "./make_csv_data.py", line 44, in <module>
        df=explode(df,lst,fill_value='',preserve_index=True)
      File "./make_csv_data.py", line 25, in explode
        index=idx).assign(**{col:np.concatenate(df.loc[lens>0, col].values) for col in lst_cols}))
      File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 3307, in assign
        data[k] = com._apply_if_callable(v, data)
      File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 3116, in __setitem__
        self._set_item(key, value)
      File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 3191, in _set_item
        value = self._sanitize_column(key, value)
      File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 3388, in _sanitize_column
        value = _sanitize_index(value, self.index, copy=False)
      File "/usr/local/lib/python3.6/site-packages/pandas/core/series.py", line 3998, in _sanitize_index
        raise ValueError('Length of values does not match length of ' 'index')
    ValueError: Length of values does not match length of index

可能是引起此问题的原因?

0 个答案:

没有答案