我有一个CSV文件,我想读入RDD或DataFrame。到目前为止,这是有效的,但是如果我收集数据并将其转换为pandas DataFrame用于绘制表格,那就是#34;格式错误"。
以下是我阅读CSV文件的方法:
NUMERIC_DATA_FILE = os.path.join(DATA_DIR, "train_numeric.csv")
numeric_rdd = sc.textFile(NUMERIC_DATA_FILE)
numeric_rdd = numeric_rdd.mapPartitions(lambda x: csv.reader(x, delimiter=","))
numeric_df = sqlContext.createDataFrame(numeric_rdd)
numeric_df.registerTempTable("numeric")
结果如下:
是否有一种简单的方法可以将CSV数据的第一行正确设置为列,将第一列正确设置为索引?
当我尝试从DataFrame
:
numeric_df.select("SELECT Id FROM numeric")
给了我:
AnalysisException: u"cannot resolve 'SELECT Id FROM numeric' given input columns _799, _640, _963, _70, _364, _143, _167,
_156, _553, _835, _780, _235, ...
答案 0 :(得分:0)
您的PySpark DataFrame没有分配架构。您应该使用下面的代码替换您的代码:
from pyspark.sql.types import *
NUMERIC_DATA_FILE = sc.textFile(os.path.join(DATA_DIR, "train_numeric.csv"))
# Extract the header line
header = NUMERIC_DATA_FILE.first()
# Assuming that all the columns are numeric, let's create a new StructField for each column
fields = [StructField(field_name, FloatType(), True) for field_name in header]
现在,我们可以构建我们的架构,
schema = StructType(fields)
# We have the remove the header from the textfile rdd
# Extracting the header (first line) from the RDD
dataHeader = NUMERIC_DATA_FILE.filter(lambda x: "Id" in x)
# Extract the data without headers. We can make use of the `subtract` function
dataNoHeader = NUMERIC_DATA_FILE.subtract(dataHeader)
numeric_temp_rdd = dataNoHeader.mapPartitions(lambda x: csv.reader(x, delimiter=","))
Schema作为参数传入createDataFrame()
函数
numeric_df = sqlContext.createDataFrame(numeric_temp_rdd,schema)
numeric_df.registerTempTable("numeric")
现在,如果您希望将此DataFrame转换为Pandas数据帧,请使用toPandas()
函数:
pandas_df = numeric_df.limit(5).toPandas()
numeric_df.select("Id")
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
sqlContext.sql('SELECT Id from numeric')