使用PySpark从CSV读取数据时重复的时间戳

时间:2016-08-11 09:35:04

标签: python csv datetime timestamp pyspark

我想使用以下格式从cvs文件中读取数据

HIERARCHYELEMENTID, REFERENCETIMESTAMP, VALUE
LOUTHNMA,"2014-12-03 00:00:00.0",0.004433333289
LOUTHNMA,"2014-12-03 00:15:00.0",0.004022222182
LOUTHNMA,"2014-12-03 00:30:00.0",0.0037666666289999998
LOUTHNMA,"2014-12-03 00:45:00.0",0.003522222187
LOUTHNMA,"2014-12-03 01:00:00.0",0.0033333332999999996   

我使用以下PySpark函数从此文件中读取

# Define a specific function to load  flow data with schema
def load_flow_data(sqlContext, filename, timeFormat):

    # Columns we're interested in
    flow_columns = ['DMAID','TimeStamp', 'Value']
    df = load_data(sqlContext, filename, flow_schema, flow_columns)

    # convert type of timestamp column from string to timestamp
    col = unix_timestamp(df['TimeStamp'], timeFormat).cast("timestamp")    
    df = df.withColumn('realTimeStamp', col)

    return df

具有以下架构和辅助功能

flow_schema = StructType([
    StructField('DMAID', StringType(), True),
    StructField('TimeStamp', StringType(), True),
    StructField('Value', FloatType(), True)
])

def load_data(sqlContext, filename, schema=None, columns=None):
# If no schema is specified, then infer the schema automatically
if schema is None:
    df = sqlContext.read.format('com.databricks.spark.csv'). \
        option('header', 'true').option('inferschema', 'true'). \
        option('mode', 'DROPMALFORMED'). \
        load(filename)
else:
    df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(filename, schema=schema)

# If no columns are specified, then select all columns
if columns is None:
    columns = schema.names
df = df.select(columns)

return df   

我使用这些命令从cvs文件加载数据

timeFormat = "yyyy-MM-dd HH:mm:SS"

df_flow_DMA = load_flow_data(sqlContext, flow_file, timeFormat)

然后我将此数据框转换为Pandas以进行可视化。

但是,我发现col = unix_timestamp(df['TimeStamp'], timeFormat).cast("timestamp")正在映射不同的日期& cvs文件中的时间字符串(在“TimeStamp”字段中找到)到相同的“realTimeStamp”字段,如附带的屏幕截图所示。

enter image description here

我怀疑问题与我传递给load_flow_data的日期时间字符串格式有关;我尝试了几种变化,但似乎没有任何效果。

有人可以提供我的代码有什么问题吗?我使用Python 2.7和Spark 1.6。

干杯

0 个答案:

没有答案