有人在Spark 2+中使用from_json
解析了毫秒时间戳吗?怎么做的?
所以Spark changed TimestampType
将时期数值解析为秒,而不是v2中的毫秒。
我的输入是一个hive表,在一个列中有一个json格式的字符串,我试图像这样解析:
val spark = SparkSession
.builder
.appName("Problematic Timestamps")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val schema = StructType(
StructField("categoryId", LongType) ::
StructField("cleared", BooleanType) ::
StructField("dataVersion", LongType) ::
StructField("details", DataTypes.createArrayType(StringType)) ::
…
StructField("timestamp", TimestampType) ::
StructField("version", StringType) :: Nil
)
val item_parsed =
spark.sql("select * FROM source.jsonStrInOrc")
.select('itemid, 'locale,
from_json('internalitem, schema)
as 'internalitem,
'version, 'createdat, 'modifiedat)
val item_flattened = item_parsed
.select('itemid, 'locale,
$"internalitem.*",
'version as'outer_version, 'createdat, 'modifiedat)
这可以使用包含以下内容的列解析行:
{“timestamp”:1494790299549,“已清除”:false,“version”:“V1”,“dataVersion”:2,“categoryId”:2641,“details”:[],...}
这样我就可以从值timestamp
中获取49338-01-08 00:39:09.0
1494790299549
个2017-05-14 19:31:39.549
字段,我更愿意将其视为:2017-05-14 19:31:39.000
现在我可以将时间戳的架构设置为long,然后将值除以1000并转换为时间戳,但之后我会2017-05-14 19:31:39.549
而不是from_json
。我无法弄清楚如何:
LongType
解析毫秒时间戳(可能通过以某种方式对TimestampType进行子类化以在架构中使用)java.sql.timestamp
并将其转换为保留毫秒的时间戳。我发现尝试在选择中进行除法然后对我来说看起来并不干净,尽管这是一种非常有效的方法。我选择了一个使用import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{explode, from_json, udf}
import org.apache.spark.sql.types.
{BooleanType, DataTypes, IntegerType, LongType,
StringType, StructField, StructType, TimestampType}
val tsmillis = udf { t: Long => new Timestamp (t) }
val spark = SparkSession
.builder
.appName("Problematic Timestamps")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val schema = StructType(
StructField("categoryId", LongType) ::
StructField("cleared", BooleanType) ::
StructField("dataVersion", LongType) ::
StructField("details", DataTypes.createArrayType(StringType)) ::
…
StructField("timestamp", LongType) ::
StructField("version", StringType) :: Nil
)
val item_parsed =
spark.sql("select * FROM source.jsonStrInOrc")
.select('itemid, 'locale,
from_json('internalitem, schema)
as 'internalitem,
'version, 'createdat, 'modifiedat)
val item_flattened = item_parsed
.select('itemid, 'locale,
$"internalitem.categoryId", $"internalitem.cleared",
$"internalitem.dataVersion", $"internalitem.details",
tsmillis($"internalitem.timestamp"),
$"internalitem.version",
'version as'outer_version, 'createdat, 'modifiedat)
的UDF,它实际上是在纪元毫秒中指定的。
withcolumn
看看选择中的情况。
我认为进行性能测试是值得的,看看使用udf
除法和转换是否比 <Window x:Class="_2EV.MainWindow"
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:tutorial="clr-namespace:_2EV.tutorial"
xmlns:local="clr-namespace:_2EV"
mc:Ignorable="d"
Title="MainWindow" Height="750" Width="430" MaxHeight="750" MaxWidth="430" Closing="Window_Closing">
<Window.Resources>
<ControlTemplate x:Key="step1">
<Grid HorizontalAlignment="Left" VerticalAlignment="Top">
<Grid Margin="5 10 0 0">
<Rectangle Stroke="Black" Fill="Yellow" RadiusX="6" RadiusY="6" Margin="0 20 0 0"/>
<Path Stroke="Black" Fill="Yellow" Data="M 25 20 L 20 0 33 20" Margin="0 1 0 0"/>
<TextBlock Text="Press new to acces add menu" Margin="5 25 0 0"/>
</Grid>
</Grid>
</ControlTemplate>
<ControlTemplate x:Key="step2">
<Grid HorizontalAlignment="Left" VerticalAlignment="Top">
<Grid Margin="5 10 0 0">
<Rectangle Stroke="Black" Fill="Yellow" RadiusX="6" RadiusY="6" Margin="0 20 0 0"/>
<Path Stroke="Black" Fill="Yellow" Data="M 25 20 L 20 0 33 20" Margin="0 1 0 0"/>
<TextBlock Text="Press classroom to acces new classrom menu" Margin="5 25 0 0"/>
</Grid>
</Grid>
</ControlTemplate>
</Window.Resources>
<DockPanel>
<Menu DockPanel.Dock ="Top" HorizontalAlignment="Stretch">
<MenuItem Header="_Tables" InputGestureText="F2" Click="MenuItem_Click_4">
<MenuItem Header="_Classrooms" Click="MenuItem_Click" ToolTip="Classrooms table, add, edit, delete classrooms"/>
<MenuItem Header="_Course" Click="MenuItem_Click_1" ToolTip="Courses table, add, edit, delete courses"/>
<MenuItem Header="_Software" Click="MenuItem_Click_2" ToolTip="Software table, add, edit, delete software"/>
<MenuItem Header="_Subject" Click="MenuItem_Click_3" ToolTip="Subjects table, add, edit, delete subjects"/>
</MenuItem>
<MenuItem x:Name="_NewHeader" Header="_New" Click="step1_start" tutorial:Adorners.Template="{StaticResource step1}">
<MenuItem Header="_Classroom" ToolTip="Classrooms table, add, edit, delete classrooms" Name="AddClassroomMenuItem" Click="AddClassroomMenuItem_Click" tutorial:Adorners.Template="{StaticResource step2}"/>
<MenuItem Header="_Course" ToolTip="Courses table, add, edit, delete courses" Name="AddCourseMenuItem" Click="AddCourseMenuItem_Click"/>
<MenuItem Header="_Software" ToolTip="Software table, add, edit, delete software" Name="AddSoftwareMenuItem" Click="AddSoftwareMenuItem_Click"/>
<MenuItem Header="_Subject" ToolTip="Subjects table, add, edit, delete subjects" Name="AddSubjectMenuItem" Click="AddSubjectMenuItem_Click"/>
</MenuItem>
<Button x:Name="tutorialButton" Content="Tutorial" Width="75" Click="startTutorial"/>
</Menu>
<DockPanel DockPanel.Dock="Right">
<DockPanel>
</DockPanel>
</DockPanel>
</DockPanel>
</Window>
更快。
答案 0 :(得分:3)
现在我可以将时间戳的架构设置为long,然后将值除以1000
实际上这正是你所需要的,只要保持正确的类型。假设您只有Long
timestamp
字段:
val df = spark.range(0, 1).select(lit(1494790299549L).alias("timestamp"))
// df: org.apache.spark.sql.DataFrame = [timestamp: bigint]
如果除以1000:
val inSeconds = df.withColumn("timestamp_seconds", $"timestamp" / 1000)
// org.apache.spark.sql.DataFrame = [timestamp: bigint, timestamp_seconds: double]
你会得到时间戳,以秒为单位(注意这是SQL,而不是Scala行为)。
剩下的就是cast
:
inSeconds.select($"timestamp_seconds".cast("timestamp")).show(false)
// +-----------------------+
// |timestamp_seconds |
// +-----------------------+
// |2017-05-14 21:31:39.549|
// +-----------------------+