使用SparkContext运行 = 本地
from pyspark import SparkContext, SparkConf, SQLContext
sc = SparkContext( 'local', 'pyspark')
sqlContext = SQLContext(sc)
path = "/root/users.parquet"
sqlContext.read.parquet(path).printSchema()
输出:
root
|-- name: string (nullable = false)
|-- favorite_color: string (nullable = true)
|-- favorite_numbers: array (nullable = false)
| |-- element: integer (containsNull = false)
使用SparkContext运行 = master [Master with 4 Slaves]
from pyspark import SparkContext, SparkConf, SQLContext
appName = "SparkClusterEvalPOC"
master = "spark://<masterHostName>:7077"
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
path = "/root/users.parquet"
sqlContext.read.parquet(path).printSchema()
输出:
16/02/01 09:16:30 WARN TaskSetManager: Lost task 111.0 in stage 0.0 (TID 111, 10.16.34.110): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/root/users.parquet; isDirectory=false; length=615; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
欢迎任何帮助。