将行号列表传递到2个数据帧,标记并添加到新数据帧

时间:2018-05-24 20:38:54

标签: python pandas dataframe

这都是在python pandas中。

我有一个名为different_rows的整数列表,其中包含每行的行号,这些行在2个数据帧之间存在差异。在这种情况下,数据帧保存来自netezza的数据和保存来自oracle的数据的数据帧。 (prepped_net_df,prepped_ora_df)

我正在尝试将行号传递给原始数据帧以从数据帧中获取数据行。我希望能够为该行添加一个标签,以了解它来自哪个数据帧(即netezza或oracle),然后我想将该行(系列)添加到新的数据帧。来自different_rows的int需要传递给netezza和oracle数据帧。

以下代码有效,但问题是它运行得很慢。

我有两个问题。

  1. 测试哪一条线导致运行缓慢的最佳方法是什么?
  2. 有没有办法优化此代码。这适用于10-1000行,但有时我的数据帧数万行。
  3. 感谢您的时间。任何帮助表示赞赏

    net_dict = {'ACCTG_DATE': ['2012-01-01 00:00:00', '2012-01-02 00:00:00', '2012-01-03 00:00:00', '2012-01-04 00:00:00' ], 'JRNL_ID_NO': ['00349-CAS','00350-CAS','00351-CAS','00352-CAS' ], 'JRNL_SEQ_NO': [43970,43971,43972,43973], 'ACCT_CODE': [8500016,8500017,8500018,8500019], 'BAL_BOOK_CODE': [8591,8592,8593,8594], 'PROD_CODE': ['12F7', '12F8', '12F9', '12G0'], 'SUSPENSE_SEQ_NO': [0, 1, 2, 3 ], 'TRAN_AMT': [8900.29, 8901.29, 8902.29, 8903.29], 'CENTER_CODE': ['', '', '', ''], 'BASIS_TYPE': ['C', 'C', 'C', 'C'], 'UPDATE_TSTP':['2011-12-31 00:00:00', '2012-01-01 00:00:00', '2012-01-02 00:00:00', '2012-01-03 00:00:00']}
    ora_dict = {'ACCTG_DATE': ['2012-01-01 00:00:00', '2012-01-02 00:00:00', '2012-01-04 00:00:00', '2012-01-04 00:00:00' ], 'JRNL_ID_NO': ['00349-CAS','00350-CAS','00351-CAS','00353-CAS' ], 'JRNL_SEQ_NO': [43970,43971,43972,43973], 'ACCT_CODE': [8500016,8500017,8500018,8500019], 'BAL_BOOK_CODE': [8591,8592,8593,8594], 'PROD_CODE': ['12F7', '12F8', '12F9', '12G0'], 'SUSPENSE_SEQ_NO': [0, 1, 2, 3 ], 'TRAN_AMT': [8900.29, 8901.29, 8903, 8903.29], 'CENTER_CODE': ['', '', '', ''], 'BASIS_TYPE': ['C', 'C', 'C', 'C'], 'UPDATE_TSTP':['2011-12-31 00:00:00', '2012-01-01 00:00:00', '2012-01-02 00:00:00', '2012-01-03 00:00:00']}
    
    different_rows = [2, 3]
    
    prepped_net_df = pd.DataFrame(data=net_dict)
    prepped_ora_df = pd.DataFrame(data=ora_dict)
    prepped_net_df.infer_objects()
    prepped_ora_df.infer_objects()
    
    row_compare_df = pd.DataFrame()
    
    if different_rows != None:
        start = time.clock()
        for val in different_rows:
            print('processed: val - ', val)
            net_series = prepped_net_df.iloc[val]
            net_series.loc['Source'] = "Netezza"
            row_compare_df = row_compare_df.append(net_series)
            ora_series = prepped_ora_df.iloc[val]
            ora_series.loc['Source'] = "Oracle"
            row_compare_df = row_compare_df.append(ora_series)
        end = time.clock() - start
        print("Cell has run completely. It took " + str(round(end, 2)) + " seconds")    
    else:
        print("There were no rows reported with differences")
    

1 个答案:

答案 0 :(得分:1)

您无需遍历列表。你需要的只是传递他的列表并从你的dfs获取行,然后如下所示连接它们:

net = prepped_net_df.iloc[different_rows].assign(Source='Netezza')
ora = prepped_ora_df.iloc[different_rows].assign(Source='Oracle')
row_compare_df = pd.concat([net, ora], ignore_index=True)