我需要在AWS Glue Classifie中定义一个grok模式,以便在文件的datestamp
列上捕获datetime
毫秒(由AWS Glue Crawler转换为string
。我使用了AWS Glue中预定义的DATESTAMP_EVENTLOG
,并尝试将毫秒添加到模式中。
分类:datetime
Grok模式:%{DATESTAMP_EVENTLOG:string}
自定义模式:
MILLISECONDS (\d){3,7}
DATESTAMP_EVENTLOG %{YEAR}-%{MONTHNUM}-%{MONTHDAY}T%{HOUR}:%{MINUTE}:%{SECOND}.%{MILLISECONDS}
答案 0 :(得分:1)
我也无法弄清楚如何使用分类器,但我最终通过将自定义转换写入映射脚本(python)将时间戳从字符串转换为datetime。
在我的工作代码下面。 col2是一个将链接器指定为字符串的列,在这里我将它转换为python datetime。
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from datetime import datetime
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "s3_events", table_name = "events", transformation_ctx = "datasource0")
def convert_dates(rec):
rec["col2"] = datetime.strptime(rec["col2"], "%d.%m.%Y")
return rec
custommapping1 = Map.apply(frame = datasource0, f = convert_dates, transformation_ctx = "custommapping1")
applymapping1 = ApplyMapping.apply(frame = custommapping1, mappings = [("col0", "string", "col0", "string"), ("col1", "string", "col1", "string"), ("col2", "date", "col2", "date")], transformation_ctx = "applymapping1")
selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["col2", "col0", "col1"], transformation_ctx = "selectfields2")
resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "mydb", table_name = "mytable", transformation_ctx = "resolvechoice3")
resolvechoice4 = ResolveChoice.apply(frame = resolvechoice3, choice = "make_cols", transformation_ctx = "resolvechoice4")
datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice4, database = "mydb", table_name = "mytable", transformation_ctx = "datasink5")
job.commit()
答案 1 :(得分:1)
对分类器的误解是,除了用于内置文件格式(例如JSON,CSV等)之外,它们还用于指定文件格式。而不是用于指定单个数据类型的解析格式。
@lilline用户建议更改数据类型的最佳方法是使用ApplyMapping函数。
创建Glue作业时,您可以选择以下选项:由AWS Glue生成的建议脚本
然后,当从“胶水目录”中选择表格作为源时,可以更改数据类型,列名等。
输出代码可能类似于以下内容:
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("paymentid", "string", "paymentid", "string"), ("updateddateutc", "string", "updateddateutc", "timestamp"), ...], transformation_ctx = "applymapping1")
有效地将更新的dateutc字符串强制转换为时间戳。
要创建分类器,您需要指定文件中的每一列。
Classifier type: Grok
Classification: Name Grok
pattern: %{MY_TIMESTAMP}
Custom patterns MY_TIMESTAMP (%{USERNAME:test}[,]%{YEAR:year}[-]%{MONTHNUM:mm}[-]%{MONTHDAY:dd} %{TIME:time})