我正在尝试查找sql语句产生的每个字段的min,max,并将其写入csv文件。我试图以以下方式获得结果。能否请你帮忙。我已经用python编写过,但是现在尝试将其转换为pyspark以直接在hadoop集群中运行
from pyspark.sql.functions import max, min, mean, stddev
from pyspark import SparkContext
sc =SparkContext()
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
#bank = hive_context.table("cip_utilities.file_upload_temp")
data=hive_context.sql("select * from cip_utilities.cdm_variables_dict")
hive_context.sql("describe cip_utilities.cdm_variables_dict").registerTempTable("schema_def")
temp_data=hive_context.sql("select * from schema_def")
temp_data.show()
data1=hive_context.sql("select col_name from schema_def where data_type<>'string'")
colum_names_as_python_list_of_rows = data1.collect()
#data1.show()
for line in colum_names_as_python_list_of_rows:
#print value in MyCol1 for each row
---Here i need to calculate min, max, mean etc for this particular field send by the for loop
答案 0 :(得分:4)
您可以使用不同的功能查找最小值和最大值。这是使用agg函数在数据框列上获取这些详细信息的一种方法。
version: '3.4'
services:
# MongoDB
lexmin-mongo:
image: mongo
container_name: lexmin-mongo
environment:
- MONGO_DATA_DIR=/data/db
- MONGO_LOG_DIR=/dev/null
ports:
- 27017:27017
command: mongod --smallfiles --logpath=/dev/null # --quiet
ports:
- 27017:27017
networks:
- lexmin-network
# MySql
lexmin-mysql:
image: mysql/mysql-server
container_name: lexmin-mysql
command: --default-authentication-plugin=mysql_native_password
environment:
- MYSQL_ROOT_PASSWORD=password
- MYSQL_ROOT_HOST=%
ports:
- 3306:3306
networks:
- lexmin-network
lexminapi:
image: XXX:lexminapi
ports:
- 34577:80
depends_on:
- lexmin-mongo
- lexmin-mysql
command: ["./wait-for-it.sh", "lexmin-mongo:27017", "--", "dotnet", "LexminApi.dll"]
environment:
# for Windows use : as separator, for non Windows use __
# (see https://github.com/aspnet/Configuration/issues/469)
- ZAX__CONNECTIONSTRING=Server=lexmin-mysql;Database=zax_master;Uid=root;Pwd=password;SslMode=none
networks:
- lexmin-network
networks:
lexmin-network
但是,您还可以探索describe和summary(从2.3版开始)函数,以获取数据框中各个列的基本统计信息。
希望这会有所帮助。
此致
Neeraj