如何从单个数据框切片和创建多个熊猫数据框

时间:2020-11-12 04:37:52

标签: python python-3.x pandas dataframe

我正在使用熊猫读取Excel文件。我想从原始数据帧创建多个数据帧。每个数据框名称应为第1行标题。另外,如何跳过每笔交易之间的一列。

预期结果:

SELECT
        a.nip,

            SUM(c.gaji_pokok + c.uang_makan + c.tunjangan + c.kendaraan + c.overtime + c.komisi + c.lain_lain + c.cuti - 
            m.pot_absen_hari * m.pot_absen_rate - IFNULL(g.pot_absen_hari * g.pot_absen_rate, 0) - CONCAT((c.uang_makan)/0.25)*0.05 -
            n.pot_komisi_dl - n.pot_komisi_p312 - n.pot_komisi_mteg - IFNULL(g.pot_komisi_kasbon, 0) - q.bpjs4 - o.pot_ppn_21pt - o.pot_pinjaman - o.pot_ppn21 - o.pot_bayar_bonus - 
            o.pot_bayar_thr - c.cuti) as bulan_ppn21,
            
            IFNULL((
            CASE
                WHEN ((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate)<=50000000)
            THEN  (0.05*((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)))
                WHEN ((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate)<=250000000)
            THEN  (0.15*((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.15)-(q.jht*12)))
                WHEN ((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate)<=500000000)
            THEN  (0.25*((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.25)-(q.jht*12))) end),0) as tahun_pph21,
        IFNULL((
            CASE
                WHEN ((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate)<=50000000)
            THEN  (0.05*((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)))
                WHEN ((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate)<=250000000)
            THEN  (0.15*((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.15)-(q.jht*12)))
                WHEN ((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate)<=500000000)
            THEN  (0.25*((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.25)-(q.jht*12))) end) -(
            CASE
                WHEN ((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate)<=50000000)
            THEN  (0.05*((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)))
                WHEN ((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate)<=250000000)
            THEN  (0.15*((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.15)-(q.jht*12)))
                WHEN ((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate)<=500000000)
            THEN  (0.25*((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.25)-(q.jht*12))) end)/12,0) - (
            CASE
                WHEN (((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate))<=50000000)
            THEN  (0.05*((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate)))
                WHEN (((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate))<=250000000)
            THEN  (0.15*((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate))-5000000)
                WHEN (((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate))<=500000000)
            THEN  (0.25*(0.03*(c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate))-55000000)*1.2
                WHEN (((c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate))<=500000000)
            THEN  (0.25*(0.03*(c.gaji_pokok * 12)-((c.gaji_pokok * 12)*0.05)-(q.jht*12)-(r.pt_kp_rate))-55000000)*1.2 end) as tot_pph21
FROM `t_pegawai` a
        LEFT JOIN t_penggajian_karyawan c ON c.nip=a.nip
        LEFT JOIN t_departemen d ON d.id_departemen=a.id_departemen
        LEFT JOIN t_jabatan e ON e.id_jabatan=a.id_jabatan
        LEFT JOIN t_perusahaan f ON f.kode_unitbisnis = a.unit_bisnis
        LEFT JOIN absensi k ON k.pin = a.pin
        LEFT JOIN t_periode l ON l.nama_periode=c.bulan  and YEAR(l.periode_start) = c.tahun
        LEFT JOIN t_potongan_absen m ON m.nip=a.nip and m.nip=c.nip and  m.bulan = l.id_periode and m.tahun = YEAR(l.periode_start)
        LEFT JOIN t_potongan_gaji g ON g.nip=a.nip and g.nip=c.nip and  g.bulan = l.id_periode and g.tahun = YEAR(l.periode_start)
        LEFT JOIN t_potongan_komisi n ON n.nip=a.nip and n.nip=c.nip and n.bulan = l.id_periode and n.tahun = YEAR(l.periode_start)
        LEFT JOIN t_potongan_ppn o ON o.nip=a.nip and o.nip=c.nip and  o.bulan = l.id_periode and o.tahun = YEAR(l.periode_start)
        LEFT JOIN t_jenjang_bpjs q ON q.nip=a.nip  and q.tahun = YEAR(l.periode_start)
        LEFT JOIN t_ptkp r ON r.pt_kp_name=a.status_ptkp
WHERE l.id_periode='8' AND f.kode_unitbisnis ='PJS-001' and k.Tanggal >= l.periode_start and k.Tanggal <= l.periode_end
GROUP BY a.pin

我尝试过的事情:

transaction_1:
name id available capacity completed all

transaction_2:
name id available capacity completed all

transaction_3:
name id available capacity completed all

enter image description here

1 个答案:

答案 0 :(得分:1)

您可以尝试以下操作(与pd.__version__ == 1.1.1一起使用):

df = (pd.read_excel(
          "capacity.xlsx", sheet_name="Sprint Details", header=[0, 1], index_col=[0, 1]
       )
      .dropna(axis=1, how="all")
      .rename_axis(index=["name", "id"], columns=[None, None]))

transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()

从本质上讲,我们需要将表读取为具有MultiIndex的数据框。前两行是我们的列名header=[0,1]。前2列是我们用于每个“子表” index_col=[0,1]的索引。

由于每个表中都有空格,因此我们将拥有完全为NaN的列,因此我们将其与.dropna(axis=1, how="all")删除。

由于pandas不希望索引名和列在同一行中,因此应错误地将索引列名["name", "id"]解析为列index的第二级名称。为了解决这个问题,我们可以手动分配正确的索引名称,同时也可以通过rename_axis(index=["name", "id"], columns=[None, None])

删除列索引名称

现在我们有了一个格式良好的表,其中包含一个MultiIndex列,我们可以简单地对每个表进行切片,并对每个表调用.reset_index(),以确保每个表都具有"name""id"作为每个表中的一列。


编辑:似乎我们在熊猫版本之间存在解析差异。

选项1。 如果您可以直接修改excel工作表以包含另一行(以更好地将列与索引名称分开)。这将提供最可靠的结果。 enter image description here

以下代码有效:

df = (pd.read_excel(
          "capacity.xlsx", sheet_name="Sprint Details", header=[0, 1], index_col=[0, 1]
       )
      .dropna(axis=1, how="all"))

transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()

选项2

如果您无法修改excel文件,那么很遗憾,我们将需要一个更复杂的方法。

df = pd.read_excel("capacity.xlsx", header=[0,1]).dropna(axis=1, how="all")
index = pd.MultiIndex.from_frame(df.iloc[:, :2].droplevel(0, axis=1))

df = df.iloc[:, 2:].set_axis(index)

transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()