我想使用以下格式从cvs文件中读取数据
HIERARCHYELEMENTID, REFERENCETIMESTAMP, VALUE
LOUTHNMA,"2014-12-03 00:00:00.0",0.004433333289
LOUTHNMA,"2014-12-03 00:15:00.0",0.004022222182
LOUTHNMA,"2014-12-03 00:30:00.0",0.0037666666289999998
LOUTHNMA,"2014-12-03 00:45:00.0",0.003522222187
LOUTHNMA,"2014-12-03 01:00:00.0",0.0033333332999999996
我使用以下PySpark函数从此文件中读取
# Define a specific function to load flow data with schema
def load_flow_data(sqlContext, filename, timeFormat):
# Columns we're interested in
flow_columns = ['DMAID','TimeStamp', 'Value']
df = load_data(sqlContext, filename, flow_schema, flow_columns)
# convert type of timestamp column from string to timestamp
col = unix_timestamp(df['TimeStamp'], timeFormat).cast("timestamp")
df = df.withColumn('realTimeStamp', col)
return df
具有以下架构和辅助功能
flow_schema = StructType([
StructField('DMAID', StringType(), True),
StructField('TimeStamp', StringType(), True),
StructField('Value', FloatType(), True)
])
def load_data(sqlContext, filename, schema=None, columns=None):
# If no schema is specified, then infer the schema automatically
if schema is None:
df = sqlContext.read.format('com.databricks.spark.csv'). \
option('header', 'true').option('inferschema', 'true'). \
option('mode', 'DROPMALFORMED'). \
load(filename)
else:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(filename, schema=schema)
# If no columns are specified, then select all columns
if columns is None:
columns = schema.names
df = df.select(columns)
return df
我使用这些命令从cvs文件加载数据
timeFormat = "yyyy-MM-dd HH:mm:SS"
df_flow_DMA = load_flow_data(sqlContext, flow_file, timeFormat)
然后我将此数据框转换为Pandas以进行可视化。
但是,我发现col = unix_timestamp(df['TimeStamp'], timeFormat).cast("timestamp")
正在映射不同的日期& cvs文件中的时间字符串(在“TimeStamp”字段中找到)到相同的“realTimeStamp”字段,如附带的屏幕截图所示。
我怀疑问题与我传递给load_flow_data
的日期时间字符串格式有关;我尝试了几种变化,但似乎没有任何效果。
有人可以提供我的代码有什么问题吗?我使用Python 2.7和Spark 1.6。
干杯