使用AWS Glue或PySpark过滤DynamicFrame

时间:2018-05-09 15:26:58

标签: python python-2.7 amazon-web-services pyspark aws-glue

我的AWS胶水数据目录中有一个名为'mytable'的表。此表位于本地Oracle数据库连接“mydb”中。

我想将生成的DynamicFrame过滤到只有X_DATETIME_INSERT列(时间戳)大于特定时间的行(在本例中为'2018-05-07 04:00:00')。之后,我正在尝试计算行以确保计数较低(表格大约为40,000行,但只有几行符合过滤条件)。

这是我目前的代码:

import boto3
from datetime import datetime
import logging
import os
import pg8000
import pytz
import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from base64 import b64decode
from pyspark.context import SparkContext
from pyspark.sql.functions import lit
## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydb", table_name = "mytable", transformation_ctx = "datasource0")

# Try Glue native filtering    
filtered_df = Filter.apply(frame = datasource0, f = lambda x: x["X_DATETIME_INSERT"] > '2018-05-07 04:00:00')
filtered_df.count()

此代码运行20分钟并超时。我尝试过其他变种:

df = datasource0.toDF()
df.where(df.X_DATETIME_INSERT > '2018-05-07 04:00:00').collect()

并且

df.filter(df["X_DATETIME_INSERT"].gt(lit("'2018-05-07 04:00:00'")))

哪个失败了。我究竟做错了什么?我在Python方面很有经验,但对Glue和PySpark来说是新手。

0 个答案:

没有答案