我在Ubuntu 17.04上安装了Hadoop 2.7.3和pyspark 2.2.0。
Hadoop和Pyspark似乎都可以自己正常工作。但是,我没有设法从Pyspark的HDFS获取文件。当我尝试从HDFS获取文件时,我收到以下错误:
我在另一篇文章中读到需要设置环境变量HADOOP_CONF_DIR来访问HDFS。我也做了(见下一个截图),但后来又出现了另一个错误,Pyspark不再工作了。
如果我删除环境变量,一切都和以前一样。
如何解决在Pyspark中从HDFS打开文件的问题?我花了很长时间在这上面,非常感谢任何帮助!
答案 0 :(得分:0)
尽管这个答案有点晚,但是您应该使用 SELECT DISTINCT T1.ItemCode,
T1.Quantity,
T1.Price,
T2.Price AS 'Supplier Price',
T3.Price AS 'Retail Price',
T4.Price AS 'Trade Price',
T5.Price AS 'Trade+ Price'
///T6.Price AS 'Special Price'///
**This shows duplicates**
///(SELECT T96.Price FROM OSPP T96 INNER JOIN OINV T90 ON T96.CardCode = T90.CardCode
INNER JOIN INV1 T91 ON T90.DocEntry = T91.DocEntry WHERE T96.ItemCode = T91.ItemCode) ///
**this shows only values in the OSPP Table and not the INV1 Table as a null value.**
FROM OINV T0
INNER JOIN
INV1 T1 ON T0.DocEntry = T1.DocEntry
INNER JOIN
ITM1 T2 ON T2.ItemCode = T1.ItemCode
INNER JOIN
ITM1 T3 ON T3.ItemCode = T1.ItemCode
INNER JOIN
ITM1 T4 ON T4.ItemCode = T1.ItemCode
INNER JOIN
ITM1 T5 ON T5.ItemCode = T1.ItemCode
///INNER JOIN
OSPP T6 ON T6.CardCode = T0.CardCode///
WHERE T0.CardCode = 'C001174' AND T1.ItemCode IS NOT NULL AND T2.PriceList = '1' AND T3.PriceList = '3' AND T4.PriceList = '10' AND T5.PriceList = '9'
(注意三个hdfs:///test/PySpark.txt
)。