PySpark-如何使用Pyspark计算每个字段的最小值,最大值?

时间:2018-11-20 09:30:14

标签: python-3.x apache-spark pyspark apache-spark-sql pyspark-sql

我正在尝试查找sql​​语句产生的每个字段的min,max,并将其写入csv文件。我试图以以下方式获得结果。能否请你帮忙。我已经用python编写过,但是现在尝试将其转换为pyspark以直接在hadoop集群中运行

enter image description here

from pyspark.sql.functions import max, min, mean, stddev
from pyspark import SparkContext
sc =SparkContext()
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
#bank = hive_context.table("cip_utilities.file_upload_temp")
data=hive_context.sql("select * from cip_utilities.cdm_variables_dict")
hive_context.sql("describe cip_utilities.cdm_variables_dict").registerTempTable("schema_def")
temp_data=hive_context.sql("select * from schema_def")
temp_data.show()
data1=hive_context.sql("select col_name from schema_def where data_type<>'string'")
colum_names_as_python_list_of_rows = data1.collect()
#data1.show()
for line in colum_names_as_python_list_of_rows:
        #print value in MyCol1 for each row                
        ---Here i need to calculate min, max, mean etc for this particular field send by the for loop

1 个答案:

答案 0 :(得分:4)

您可以使用不同的功能查找最小值和最大值。这是使用agg函数在数据框列上获取这些详细信息的一种方法。

   version: '3.4'

   services:
     # MongoDB
     lexmin-mongo:
       image: mongo
       container_name: lexmin-mongo
       environment:
         - MONGO_DATA_DIR=/data/db
         - MONGO_LOG_DIR=/dev/null
       ports:
         - 27017:27017
       command: mongod --smallfiles --logpath=/dev/null # --quiet
       ports:
         - 27017:27017
       networks:
         - lexmin-network

     # MySql
     lexmin-mysql:
       image: mysql/mysql-server
       container_name: lexmin-mysql
       command: --default-authentication-plugin=mysql_native_password
       environment:
         - MYSQL_ROOT_PASSWORD=password
         - MYSQL_ROOT_HOST=%
       ports:
         - 3306:3306
       networks:
         - lexmin-network

     lexminapi:
       image: XXX:lexminapi
       ports:
         - 34577:80
       depends_on:
         - lexmin-mongo
         - lexmin-mysql
       command: ["./wait-for-it.sh", "lexmin-mongo:27017", "--", "dotnet", "LexminApi.dll"]
       environment:
         # for Windows use : as separator, for non Windows use __
         # (see https://github.com/aspnet/Configuration/issues/469)
         - ZAX__CONNECTIONSTRING=Server=lexmin-mysql;Database=zax_master;Uid=root;Pwd=password;SslMode=none
       networks:
         - lexmin-network

   networks:
     lexmin-network

但是,您还可以探索describesummary(从2.3版开始)函数,以获取数据框中各个列的基本统计信息。

希望这会有所帮助。

此致

Neeraj