我有要以拼花格式存储在hdfs上的数据,我想使用spark处理这些数据。
平台:
Ubuntu 16.04
Spark 2.1.3
Hadoop 2.6.5
以下是存储数据的目录内容的列表:
hdfs dfs -ls /databases/crawl_data_stats/scraped_metadata
Found 6 items
drwxr-xr-x - root supergroup 0 2019-06-13 12:30 /databases/crawl_data_stats/scraped_metadata/.metadata
drwxr-xr-x - root supergroup 0 2019-06-13 12:32 /databases/crawl_data_stats/scraped_metadata/.signals
-rw-r--r-- 1 root supergroup 87819081 2019-06-13 12:32 /databases/crawl_data_stats/scraped_metadata/3695c4ed-e140-4a01-aa27-bd29b5fb7be5.parquet
-rw-r--r-- 1 root supergroup 92307005 2019-06-13 12:31 /databases/crawl_data_stats/scraped_metadata/4fc7732b-2a7b-4a56-a034-16bc0393c0b9.parquet
-rw-r--r-- 1 root supergroup 69329182 2019-06-13 12:31 /databases/crawl_data_stats/scraped_metadata/a69db553-1ac7-469d-b55c-ff4133f4b8dc.parquet
-rw-r--r-- 1 root supergroup 90382508 2019-06-13 12:32 /databases/crawl_data_stats/scraped_metadata/d7ca247f-7832-4b0b-88b6-f940dcfe9df4.parquet
我尝试读取实木复合地板文件为:
temp = spark.read.parquet("hdfs://localhost:9000/databases/crawl_data_stats/scraped_metadata")
这是我得到的部分错误
19/06/17 11:00:14 WARN DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 1300.3553468732534 msec.
19/06/17 11:00:16 WARN DFSClient: DFS chooseDataNode: got # 2 IOException, will wait for 4782.536317936689 msec.
19/06/17 11:00:21 WARN DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 6014.259159548854 msec.
19/06/17 11:00:27 WARN DFSClient: Could not obtain block: BP-317098980-127.0.0.1-1560408421762:blk_1073741878_1054 file=/databases/crawl_data_stats/scraped_metadata/3695c4ed-e140-4a01-aa27-bd29b5fb7be5.parquet No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException
19/06/17 11:00:27 WARN DFSClient: Could not obtain block: BP-317098980-127.0.0.1-1560408421762:blk_1073741878_1054 file=/databases/crawl_data_stats/scraped_metadata/3695c4ed-e140-4a01-aa27-bd29b5fb7be5.parquet No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException
19/06/17 11:00:27 WARN DFSClient: DFS Read
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-317098980-127.0.0.1-1560408421762:blk_1073741878_1054 file=/databases/crawl_data_stats/scraped_metadata/3695c4ed-e140-4a01-aa27-bd29b5fb7be5.parquet
...
19/06/17 11:00:27 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark/spark-2.1.3-bin-hadoop2.6/python/pyspark/sql/readwriter.py", line 274, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "/usr/local/spark/spark-2.1.3-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/spark-2.1.3-bin-hadoop2.6/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/spark/spark-2.1.3-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o27.parquet.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-317098980-127.0.0.1-1560408421762:blk_1073741878_1054 file=/databases/crawl_data_stats/scraped_metadata/3695c4ed-e140-4a01-aa27-bd29b5fb7be5.parquet
...
感谢帮助