这都是在python pandas中。
我有一个名为different_rows的整数列表,其中包含每行的行号,这些行在2个数据帧之间存在差异。在这种情况下,数据帧保存来自netezza的数据和保存来自oracle的数据的数据帧。 (prepped_net_df,prepped_ora_df)
我正在尝试将行号传递给原始数据帧以从数据帧中获取数据行。我希望能够为该行添加一个标签,以了解它来自哪个数据帧(即netezza或oracle),然后我想将该行(系列)添加到新的数据帧。来自different_rows的int需要传递给netezza和oracle数据帧。
以下代码有效,但问题是它运行得很慢。
我有两个问题。
感谢您的时间。任何帮助表示赞赏
net_dict = {'ACCTG_DATE': ['2012-01-01 00:00:00', '2012-01-02 00:00:00', '2012-01-03 00:00:00', '2012-01-04 00:00:00' ], 'JRNL_ID_NO': ['00349-CAS','00350-CAS','00351-CAS','00352-CAS' ], 'JRNL_SEQ_NO': [43970,43971,43972,43973], 'ACCT_CODE': [8500016,8500017,8500018,8500019], 'BAL_BOOK_CODE': [8591,8592,8593,8594], 'PROD_CODE': ['12F7', '12F8', '12F9', '12G0'], 'SUSPENSE_SEQ_NO': [0, 1, 2, 3 ], 'TRAN_AMT': [8900.29, 8901.29, 8902.29, 8903.29], 'CENTER_CODE': ['', '', '', ''], 'BASIS_TYPE': ['C', 'C', 'C', 'C'], 'UPDATE_TSTP':['2011-12-31 00:00:00', '2012-01-01 00:00:00', '2012-01-02 00:00:00', '2012-01-03 00:00:00']}
ora_dict = {'ACCTG_DATE': ['2012-01-01 00:00:00', '2012-01-02 00:00:00', '2012-01-04 00:00:00', '2012-01-04 00:00:00' ], 'JRNL_ID_NO': ['00349-CAS','00350-CAS','00351-CAS','00353-CAS' ], 'JRNL_SEQ_NO': [43970,43971,43972,43973], 'ACCT_CODE': [8500016,8500017,8500018,8500019], 'BAL_BOOK_CODE': [8591,8592,8593,8594], 'PROD_CODE': ['12F7', '12F8', '12F9', '12G0'], 'SUSPENSE_SEQ_NO': [0, 1, 2, 3 ], 'TRAN_AMT': [8900.29, 8901.29, 8903, 8903.29], 'CENTER_CODE': ['', '', '', ''], 'BASIS_TYPE': ['C', 'C', 'C', 'C'], 'UPDATE_TSTP':['2011-12-31 00:00:00', '2012-01-01 00:00:00', '2012-01-02 00:00:00', '2012-01-03 00:00:00']}
different_rows = [2, 3]
prepped_net_df = pd.DataFrame(data=net_dict)
prepped_ora_df = pd.DataFrame(data=ora_dict)
prepped_net_df.infer_objects()
prepped_ora_df.infer_objects()
row_compare_df = pd.DataFrame()
if different_rows != None:
start = time.clock()
for val in different_rows:
print('processed: val - ', val)
net_series = prepped_net_df.iloc[val]
net_series.loc['Source'] = "Netezza"
row_compare_df = row_compare_df.append(net_series)
ora_series = prepped_ora_df.iloc[val]
ora_series.loc['Source'] = "Oracle"
row_compare_df = row_compare_df.append(ora_series)
end = time.clock() - start
print("Cell has run completely. It took " + str(round(end, 2)) + " seconds")
else:
print("There were no rows reported with differences")
答案 0 :(得分:1)
您无需遍历列表。你需要的只是传递他的列表并从你的dfs获取行,然后如下所示连接它们:
net = prepped_net_df.iloc[different_rows].assign(Source='Netezza')
ora = prepped_ora_df.iloc[different_rows].assign(Source='Oracle')
row_compare_df = pd.concat([net, ora], ignore_index=True)