使用Spark 2

时间:2017-06-01 17:57:56

标签: json parsing apache-spark

有人在Spark 2+中使用from_json解析了毫秒时间戳吗?怎么做的?

所以Spark changed TimestampType将时期数值解析为秒,而不是v2中的毫秒。

我的输入是一个hive表,在一个列中有一个json格式的字符串,我试图像这样解析:

val spark = SparkSession
  .builder
  .appName("Problematic Timestamps")
  .enableHiveSupport()
  .getOrCreate()
import spark.implicits._
val schema = StructType(
  StructField("categoryId", LongType) ::
  StructField("cleared", BooleanType) ::
  StructField("dataVersion", LongType) ::
  StructField("details", DataTypes.createArrayType(StringType)) ::
  …
  StructField("timestamp", TimestampType) ::
  StructField("version", StringType) :: Nil
)
val item_parsed =
    spark.sql("select * FROM source.jsonStrInOrc")
    .select('itemid, 'locale,
            from_json('internalitem, schema)
                as 'internalitem,
            'version, 'createdat, 'modifiedat)
val item_flattened = item_parsed
    .select('itemid, 'locale,
            $"internalitem.*",
            'version as'outer_version, 'createdat, 'modifiedat)

这可以使用包含以下内容的列解析行:

  

{“timestamp”:1494790299549,“已清除”:false,“version”:“V1”,“dataVersion”:2,“categoryId”:2641,“details”:[],...}

这样我就可以从值timestamp中获取49338-01-08 00:39:09.0 14947902995492017-05-14 19:31:39.549字段,我更愿意将其视为:2017-05-14 19:31:39.000

现在我可以将时间戳的架构设置为long,然后将值除以1000并转换为时间戳,但之后我会2017-05-14 19:31:39.549而不是from_json。我无法弄清楚如何:

  • 告诉LongType解析毫秒时间戳(可能通过以某种方式对TimestampType进行子类化以在架构中使用)
  • 在架构中使用java.sql.timestamp并将其转换为保留毫秒的时间戳。

UDF附录

我发现尝试在选择中进行除法然后对我来说看起来并不干净,尽管这是一种非常有效的方法。我选择了一个使用import java.sql.Timestamp import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.{explode, from_json, udf} import org.apache.spark.sql.types. {BooleanType, DataTypes, IntegerType, LongType, StringType, StructField, StructType, TimestampType} val tsmillis = udf { t: Long => new Timestamp (t) } val spark = SparkSession .builder .appName("Problematic Timestamps") .enableHiveSupport() .getOrCreate() import spark.implicits._ val schema = StructType( StructField("categoryId", LongType) :: StructField("cleared", BooleanType) :: StructField("dataVersion", LongType) :: StructField("details", DataTypes.createArrayType(StringType)) :: … StructField("timestamp", LongType) :: StructField("version", StringType) :: Nil ) val item_parsed = spark.sql("select * FROM source.jsonStrInOrc") .select('itemid, 'locale, from_json('internalitem, schema) as 'internalitem, 'version, 'createdat, 'modifiedat) val item_flattened = item_parsed .select('itemid, 'locale, $"internalitem.categoryId", $"internalitem.cleared", $"internalitem.dataVersion", $"internalitem.details", tsmillis($"internalitem.timestamp"), $"internalitem.version", 'version as'outer_version, 'createdat, 'modifiedat) 的UDF,它实际上是在纪元毫秒中指定的。

withcolumn

看看选择中的情况。 我认为进行性能测试是值得的,看看使用udf除法和转换是否比 <Window x:Class="_2EV.MainWindow" xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" xmlns:d="http://schemas.microsoft.com/expression/blend/2008" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:tutorial="clr-namespace:_2EV.tutorial" xmlns:local="clr-namespace:_2EV" mc:Ignorable="d" Title="MainWindow" Height="750" Width="430" MaxHeight="750" MaxWidth="430" Closing="Window_Closing"> <Window.Resources> <ControlTemplate x:Key="step1"> <Grid HorizontalAlignment="Left" VerticalAlignment="Top"> <Grid Margin="5 10 0 0"> <Rectangle Stroke="Black" Fill="Yellow" RadiusX="6" RadiusY="6" Margin="0 20 0 0"/> <Path Stroke="Black" Fill="Yellow" Data="M 25 20 L 20 0 33 20" Margin="0 1 0 0"/> <TextBlock Text="Press new to acces add menu" Margin="5 25 0 0"/> </Grid> </Grid> </ControlTemplate> <ControlTemplate x:Key="step2"> <Grid HorizontalAlignment="Left" VerticalAlignment="Top"> <Grid Margin="5 10 0 0"> <Rectangle Stroke="Black" Fill="Yellow" RadiusX="6" RadiusY="6" Margin="0 20 0 0"/> <Path Stroke="Black" Fill="Yellow" Data="M 25 20 L 20 0 33 20" Margin="0 1 0 0"/> <TextBlock Text="Press classroom to acces new classrom menu" Margin="5 25 0 0"/> </Grid> </Grid> </ControlTemplate> </Window.Resources> <DockPanel> <Menu DockPanel.Dock ="Top" HorizontalAlignment="Stretch"> <MenuItem Header="_Tables" InputGestureText="F2" Click="MenuItem_Click_4"> <MenuItem Header="_Classrooms" Click="MenuItem_Click" ToolTip="Classrooms table, add, edit, delete classrooms"/> <MenuItem Header="_Course" Click="MenuItem_Click_1" ToolTip="Courses table, add, edit, delete courses"/> <MenuItem Header="_Software" Click="MenuItem_Click_2" ToolTip="Software table, add, edit, delete software"/> <MenuItem Header="_Subject" Click="MenuItem_Click_3" ToolTip="Subjects table, add, edit, delete subjects"/> </MenuItem> <MenuItem x:Name="_NewHeader" Header="_New" Click="step1_start" tutorial:Adorners.Template="{StaticResource step1}"> <MenuItem Header="_Classroom" ToolTip="Classrooms table, add, edit, delete classrooms" Name="AddClassroomMenuItem" Click="AddClassroomMenuItem_Click" tutorial:Adorners.Template="{StaticResource step2}"/> <MenuItem Header="_Course" ToolTip="Courses table, add, edit, delete courses" Name="AddCourseMenuItem" Click="AddCourseMenuItem_Click"/> <MenuItem Header="_Software" ToolTip="Software table, add, edit, delete software" Name="AddSoftwareMenuItem" Click="AddSoftwareMenuItem_Click"/> <MenuItem Header="_Subject" ToolTip="Subjects table, add, edit, delete subjects" Name="AddSubjectMenuItem" Click="AddSubjectMenuItem_Click"/> </MenuItem> <Button x:Name="tutorialButton" Content="Tutorial" Width="75" Click="startTutorial"/> </Menu> <DockPanel DockPanel.Dock="Right"> <DockPanel> </DockPanel> </DockPanel> </DockPanel> </Window> 更快。

1 个答案:

答案 0 :(得分:3)

  

现在我可以将时间戳的架构设置为long,然后将值除以1000

实际上这正是你所需要的,只要保持正确的类型。假设您只有Long timestamp字段:

val df = spark.range(0, 1).select(lit(1494790299549L).alias("timestamp"))
// df: org.apache.spark.sql.DataFrame = [timestamp: bigint]

如果除以1000:

val inSeconds = df.withColumn("timestamp_seconds", $"timestamp" / 1000)
// org.apache.spark.sql.DataFrame = [timestamp: bigint, timestamp_seconds: double]

你会得到时间戳,以秒为单位(注意这是SQL,而不是Scala行为)。

剩下的就是cast

inSeconds.select($"timestamp_seconds".cast("timestamp")).show(false)
// +-----------------------+
// |timestamp_seconds      |
// +-----------------------+
// |2017-05-14 21:31:39.549|
// +-----------------------+