我使用mysql
从pyspark
表中获取数据,如下所示。
df = sqlContext.read.format("jdbc").option("url", "{}:{}/{}".format(domain,port,mysqldb)).option("driver", "com.mysql.jdbc.Driver").option("dbtable", "(select ifnull(max(id),0) as maxval, ifnull(min(id),0) as minval, ifnull(min(test_time),'1900-01-01 00:00:00') as mintime, ifnull(max(test_time),'1900-01-01 00:00:00') as maxtime FROM `{}`) as `{}`".format(table, table)).option("user", "{}".format(mysql_user)).option("password", "{}".format(password)).load()
df.show()
的结果低于
+------+------+-------------------+-------------------+
|maxval|minval| mintime| maxtime|
+------+------+-------------------+-------------------+
| 1721| 1|2017-03-09 22:15:49|2017-12-14 05:17:04|
+------+------+-------------------+-------------------+
现在我想单独获取列及其值。
我想要
max_valval = 1721
min_valval = 1
min_timetime = 2017-03-09 22:15:49
max_timetime = 2017-12-14 05:17:04
我在下面做过。
max_val = df.select('maxval').collect()[0].asDict()['maxval']
min_val = df.select('minval').collect()[0].asDict()['minval']
max_time = df.select('maxtime').collect()[0].asDict()['maxtime']
min_time = df.select('mintime').collect()[0].asDict()['mintime']
有没有更好的方法在pyspark
中执行此操作。
答案 0 :(得分:2)
目前您正在使用collect
4次,这是符合成本效益的。你可以尝试一些python技能来做到这一点。我有一种方法可以尝试: -
df = (sqlContext.read.format("jdbc")
.option("url", "{}:{}/{}".format(domain,port,mysqldb))
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", """(
select ifnull(max(id),0) as maxval, ifnull(min(id),0) as minval,
ifnull(min(test_time),'1900-01-01 00:00:00') as mintime,
ifnull(max(test_time), '1900-01-01 00:00:00') as maxtime
FROM `{}`) as `{}`""".format(table, table))
.option("user", "{}".format(mysql_user))
.option("password", "{}".format(password)).load())
for key, value in df.first().asDict().items():
globals()[key] = value
print minval
print maxval
print mintime
print maxtime
通过这种方式,您可以将列转换为变量。如果您需要进一步的帮助,请告诉我。