pySpark如何使用sqlContext.sql更改列名

时间:2017-04-03 00:16:31

标签: csv dataframe pyspark

我使用sqlContext.sql运行命令 结果是一个数据框,但列名没有反映我试图表示的内容

test=("SELECT SUBJECT_ID ,DRUG, COUNT(*), SUM(DOSE_VAL_RX) AS Dosage\
from sel_meds_pats_icustays \
GROUP BY SUBJECT_ID , DRUG ")
test_df= sqlContext.sql(tot_icustay_meds_pat_query).withColumnRenamed('_c1','count')

结果Dataframe以这种方式显示...尽管我试图重命名列。

 [Row(SUBJECT_ID=6, DRUG=u'Syringe (IV Room)', count(1)=3, sum(CAST(DOSE_VAL_RX AS DOUBLE))=3.0), Row(SUBJECT_ID=13, DRUG=u'Potassium Chloride', count(1)=2, sum(CAST(DOSE_VAL_RX AS DOUBLE))=60.0), Row(SUBJECT_ID=36, DRUG=u'Cisatracurium Besylate', count(1)=1, sum(CAST(DOSE_VAL_RX AS DOUBLE))=100.0), Row(SUBJECT_ID=36, DRUG=u'Heparin Flush CVL  (100 units/ml)', count(1)=1, sum(CAST(DOSE_VAL_RX AS DOUBLE))=1.0), Row(SUBJECT_ID=36, DRUG=u'Lansoprazole Oral Disintegrating Tab', count(1)=1, sum(CAST(DOSE_VAL_RX AS DOUBLE))=30.0)]

我也尝试在SELECT语句中使用ALIAS,它也没有显示

另外,如何将此结果保存到csv文件..

1 个答案:

答案 0 :(得分:0)

您可以使用下面提到的三种方法: -

1) sqlContext.sql("SELECT SUBJECT_ID, DRUG, COUNT(*) total_subject, SUM(DOSE_VAL_RX) dosage\
                     from sel_meds_pats_icustays \
                     GROUP BY SUBJECT_ID , DRUG ").show()


2) sqlContext.sql("SELECT SUBJECT_ID, DRUG, COUNT(*), SUM(DOSE_VAL_RX) \
                     from sel_meds_pats_icustays \
                     GROUP BY SUBJECT_ID , DRUG ").withColumnRenamed("_c0", "total_subjects").withColumnRenamed("_c1", "dosage").show()

import pyspark.sql.functions as F
3) sqlContext.sql("SELECT SUBJECT_ID, DRUG, COUNT(*), SUM(DOSE_VAL_RX) \
                     from sel_meds_pats_icustays \
                     GROUP BY SUBJECT_ID , DRUG ").select("SUBJECT_ID", "DRUG", col("_c0").alias("total_subjects"), col("_c1").alias("dosage")).show()

保存到CSV: -

# First Install This Package com.databricks.spark.csv 
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("testing.csv")