Question

我正在使用Google或工具，并且在其中一个示例中给出了一种数据结构。我想基于Excel表格导入此数据结构。

这是给定的数据结构：

         SELECT  
            pr.product_id
            ,pr.price set_price
            ,pr.start_date -- the timestamp of price change
            ,date(max(pr.start_date))
            ,@'max_change' := max(pr.start_date)

        FROM prices pr

        where product_id in ( -- get the id's of specific products
                            ....                
            )

        group by pr.product_id

        having date(pr.start_date) = date(@max_change)

我想做的是基于一个带有如下数据的Excel工作表导入jobs = [[[(3, 0), (1, 1), (5, 2)], [(2, 0), (4, 1), (6, 2)], [(2, 0), (3, 1), (1, 2)]], [[(2, 0), (3, 1), (4, 2)], [(1, 0), (5, 1), (4, 2)], [(2, 0), (1, 1), (4, 2)]], [[(2, 0), (1, 1), (4, 2)], [(2, 0), (3, 1), (4, 2)], [(3, 0), (1, 1), (5, 2)]]]：

jobs

Answer 1

您应该重新组织所有数据，并按Job分组。例如：

yarn_conf = SparkConf().setAppName(_app_name) \
                    .setMaster("yarn") \
                    .set("spark.executor.memory", "4g") \
                    .set("spark.hadoop.fs.defaultFS", "hdfs://{}:8020".format(_fs_host)) \
                    .set("spark.hadoop.yarn.resourcemanager.hostname", _rm_host)\
                    .set("spark.hadoop.yarn.resourcemanager.address", "{}:8050".format(_rm_host))

警告：结果与您编写的内容不完全相同。我认为您弄错了一些数据。告诉我我是否错。

Answer 2

我使用熊猫的groupby API给出了另一个答案：

import pandas as pd

df = pd.read_excel('bb.xlsx')

result = [[[ (row['M1'],0), (row['M2'],1), (row['M3'],2) ] for idx, row in grpdf.iterrows()] for grpname, grpdf in df.groupby('Job')]
print(result)

如何基于Excel工作表创建某个数据框？

2 个答案: