Question

目前，我有以下情况。

Excel Data Frame =            SQL Data Frame =
________                          ________ _______ ___________ _________
|sector|                          |sector| | hour| | value_cs| value_ps|
--------                          -------- ------- ----------- ---------
 AXYZ                              AXYZ      0        78.90      87.10
 BYYT                              RACH      0        87.12      13.90
 IOPL                              IOPL      0        93.10      13.87
 XFTR                              AXYZ      1        27.90      12.87
 MANU                              IOPL      1        23.09      90.09
                                   FRES      2        34.09      12.34
                                   YYYT      2        12.43      32.98
                                   REWT      3        98.09      99.99

我有一个Excel文件和一组SQL结果，我想比较Excel文件中扇区列的每个值与SQL结果中Sector列的所有值，结果，如果这两列的值之间存在匹配，则将SQL结果中的小时， value_cs 和 value_ps 列添加到新的列中数据帧。 注意： SQL结果的数据与Excel文件的数据大小不同。

期望的结果

 New data frame 1 for value cs
  ________ ____    ___    ___    ___    ___    ___    ___        ____                     
  |sector|  |0|    |1|    |2|    |3|    |4|    |5|    |6|   .... |23|
  -------- ----    ---    ----   ---    ---    ---    ----       ----                            
   AXYZ    78.90   27.90  78.89  54.90  98.23  85.0   45.90      68.23
   BYYT    18.94   67.10  65.69  76.32  76.56  56.03  56.23      87.65
   IOPL    93.10   23.09  34.29  97.34  34.34  14.54  34.91      23.21
   ...      ...

 New data frame 2 for value ps
  ________ ____    ___    ___    ___    ___    ___    ___        ____                     
  |sector|  |0|    |1|    |2|    |3|    |4|    |5|    |6|   .... |23|
  -------- ----    ---    ----   ---    ---    ---    ----       ----                            
   AXYZ    87.10   12.87  49.89  84.90  76.23  15.01  12.90      68.23
   BYYT    28.43   27.11  54.69  57.12  19.56  45.12  45.23      47.15
   IOPL    13.87   90.09  24.19  47.34  18.34  21.54  67.11      13.61
   ...      ...

我遵循的方法是将SQL结果转换为数据框以及Excel文件中的数据，但我不知道如何在没有for循环的情况下执行比较，而只使用Pandas（for循环）将花费太多时间来执行计算。）

import pandas as pd
import pypyodbc
from datetime import date

def get_and_compare():

    start_date = date.today()

    retrieve_values = "[DEV].[CS].[QA_Export] @start_date='{start_date:%Y-%m-%d}'".format(start_date=start_date)

    # Connect to the database
    db_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="xxx",
                                         pwd="xxx", Trusted_Connection="No")

    # Get the sql result into dataframe
    data_frame_sql = pd.read_sql(retrieve_values,db_connection)


    #declare new data frames
    new_df_one = pd.DataFrame(columns=['sector', 'value cs', 'hour 0', 'hour 1', 'hour 2', 'hour 3', 'hour 4',
                                   'hour 5', 'hour 6', 'hour 7', 'hour 8', 'hour 9', 'hour 10', 'hour 11',
                                   'hour 12', 'hour 13', 'hour 14', 'hour 15', 'hour 16', 'hour 17', 'hour 18',
                                   'hour 19', 'hour 20', 'hour 21', 'hour 22', 'hour 23'])

    new_df_two = pd.DataFrame(columns=['sector', 'value ps', 'hour 0', 'hour 1', 'hour 2', 'hour 3', 'hour 4',
                                   'hour 5', 'hour 6', 'hour 7', 'hour 8', 'hour 9', 'hour 10', 'hour 11',
                                   'hour 12', 'hour 13', 'hour 14', 'hour 15', 'hour 16', 'hour 17', 'hour 18',
                                   'hour 19', 'hour 20', 'hour 21', 'hour 22', 'hour 23'])


    # Read the Excel file
    current_wb = pd.ExcelFile \
    ("C:\\U\\dev\\testing\\Main values to compare.xlsx")

    # Get the specific sheet to compare
    working_values = current_wb.parse("Main values")

    #Get the column from Excel
    sector_from_excel  = working_values['sector']

    #Comparison to perform
    #.... unknown part

所有建议，评论将不胜感激，以帮助我完成这部分代码。

Answer 1

试试这个：

def get_and_compare():

    start_date = date.today()

    retrieve_values = "[DEV].[CS].[QA_Export] @start_date='{start_date:%Y-%m-%d}'".format(start_date=start_date)

    # Connect to the database
    db_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="xxx",
                                         pwd="xxx", Trusted_Connection="No")

    # Get the sql result into dataframe
    data_frame_sql = pd.read_sql(retrieve_values,db_connection)


    # Read the Excel file
    current_wb = pd.ExcelFile \
    ("C:\\U\\dev\\testing\\Main values to compare.xlsx")

    # Get the specific sheet to compare
    working_values = current_wb.parse("Main values")

    #Get the column from Excel
    sector_from_excel  = working_values['sector']

    # perform inner join between DataFrames
    # note: this requires that "sector" is a column (and not an index)
    # in both DataFrames, and that it is also named as "sector" in each
    merged_df = data_frame_sql.merge(sector_from_excel, how="inner", on="sector")

    # use "pivot" to reshape data from wide to long
    # first with value_cs
    cs_value_df = merged_df.pivot(index="sector", columns="hour", values="value_cs")

    # and then with value_ps
    ps_value_df = merged_df.pivot(index="sector", columns="hour", values="value_ps")

    # I'd suggest returning both DataFrames in a single object;
    # in this case I'm using a dict
    return {"value cs": cs_value_df, "value ps": ps_value_df}

对于它的价值，我建议将此功能拆分为多个功能，可能一个用于生成SQL查询，一个用于读取Excel文件，另一个用于执行Pandas操作。将这么多动作写入一个函数并不是一个好习惯 - 如果有必要的话，调试会很繁琐。

如何将一个值与Python中的多个值与Pandas库进行比较？

1 个答案: