我正在尝试将postgres数据读取到我的spark数据框中。一切正常,直到我加载一列数据结构为json数组的列。 如果我检索了1个月的数据,它就会超时 如果我检索到1天的数据,而不是过去的5分钟,则大约需要30分钟。而且json数组的数据结构是字符串,因此我无法对其应用任何UDF。
如何以更少的时间加载超过1天的数据?
如何确定格式为json和字符串数组?
conf = pyspark.SparkConf().setAll([("spark.driver.extraClassPath", "/usr/local/bin/postgresql-42.2.5.jar:/usr/local/jar/gcs-connector-hadoop2-latest.jar")
,("spark.executor.instances", "4")
,("spark.executor.cores", "4")
,("spark.executor.memories", "10g")
,("spark.driver.memory", "10g")
,("spark.dirver.maxResultSize", "10g")
,("spark.memory.offHeap.enabled","true")
,("spark.memory.offHeap.size","40g")])
sc = pyspark.SparkContext(conf=conf)
sc.getConf().getAll()
sqlContext = SQLContext(sc)
query = """
select
users.userid
,users.createdat as users_createdat
,jsoncolumn
from users
left join actions
on users.userid = actions.userid
where users.createdat between '{start_date}' and '{end_date}'
and geo->>'country'='US'
""".format(start_date=start_date, end_date = end_date)
url = 'postgresql://url'
df_join = sqlContext.read.format("jdbc")\
.option("url",'jdbc:%s' % url)\
.option("user", 'username')\
.option("password", 'pass')\
.option("query",query)\
.load()
start_time = time.time()
df_join.cache()
df_join.collect()
end_time = time.time()
print((end_time - start_time))