从Pyspark访问HDFS失败

时间:2017-09-21 12:38:06

标签: ubuntu hadoop apache-spark pyspark hdfs

我在Ubuntu 17.04上安装了Hadoop 2.7.3和pyspark 2.2.0。

Hadoop和Pyspark似乎都可以自己正常工作。但是,我没有设法从Pyspark的HDFS获取文件。当我尝试从HDFS获取文件时,我收到以下错误:

https://imgur.com/j6Dy2u7

我在另一篇文章中读到需要设置环境变量HADOOP_CONF_DIR来访问HDFS。我也做了(见下一个截图),但后来又出现了另一个错误,Pyspark不再工作了。

https://imgur.com/AMpJ6TB

如果我删除环境变量,一切都和以前一样。

如何解决在Pyspark中从HDFS打开文件的问题?我花了很长时间在这上面,非常感谢任何帮助!

1 个答案:

答案 0 :(得分:0)

尽管这个答案有点晚,但是您应该使用 SELECT DISTINCT T1.ItemCode, T1.Quantity, T1.Price, T2.Price AS 'Supplier Price', T3.Price AS 'Retail Price', T4.Price AS 'Trade Price', T5.Price AS 'Trade+ Price' ///T6.Price AS 'Special Price'/// **This shows duplicates** ///(SELECT T96.Price FROM OSPP T96 INNER JOIN OINV T90 ON T96.CardCode = T90.CardCode INNER JOIN INV1 T91 ON T90.DocEntry = T91.DocEntry WHERE T96.ItemCode = T91.ItemCode) /// **this shows only values in the OSPP Table and not the INV1 Table as a null value.** FROM OINV T0 INNER JOIN INV1 T1 ON T0.DocEntry = T1.DocEntry INNER JOIN ITM1 T2 ON T2.ItemCode = T1.ItemCode INNER JOIN ITM1 T3 ON T3.ItemCode = T1.ItemCode INNER JOIN ITM1 T4 ON T4.ItemCode = T1.ItemCode INNER JOIN ITM1 T5 ON T5.ItemCode = T1.ItemCode ///INNER JOIN OSPP T6 ON T6.CardCode = T0.CardCode/// WHERE T0.CardCode = 'C001174' AND T1.ItemCode IS NOT NULL AND T2.PriceList = '1' AND T3.PriceList = '3' AND T4.PriceList = '10' AND T5.PriceList = '9' (注意三个hdfs:///test/PySpark.txt)。