所以我想做的只是转换字段:
year, month, day, hour, minute
(如下所示为整数类型)为字符串类型。
所以我有一个类型为:
的数据帧df_src<class 'pyspark.sql.dataframe.DataFrame'>
这是它的架构:
root
|-- src_ip: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- hour: integer (nullable = true)
|-- minute: integer (nullable = true)
我之前也声明过一个函数:
def parse_df_to_string(year, month, day, hour=0, minute=0):
second = 0
return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(year, month, day, hour, minute, second)
我也做了一个测试,它就像一个魅力:
print parse_df_to_string(2016, 10, 15, 21)
print type(parse_df_to_string(2016, 10, 15, 21))
2016-10-15 21:00:00
<type 'str'>
所以我也做了类似于使用udf的spark api:
from pyspark.sql.functions import udf
u_parse_df_to_string = udf(parse_df_to_string)
最后这个请求:
df_src.select('*',
u_parse_df_to_string(df_src['year'], df_src['month'], df_src['day'], df_src['hour'], df_src['minute'])
).show()
会导致:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-126-770b587e10e6> in <module>()
25 # Could not make this part wor..
26 df_src.select('*',
---> 27 u_parse_df_to_string(df_src['year'], df_src['month'], df_src['day'], df_src['hour'], df_src['minute'])
28 ).show()
/opt/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate)
285 +---+-----+
286 """
--> 287 print(self._jdf.showString(n, truncate))
288
289 def __repr__(self):
/opt/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/opt/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
...
Py4JJavaError: An error occurred while calling o5074.showString.
: java.lang.UnsupportedOperationException: Cannot evaluate expression: parse_df_to_string(input[1, int, true], input[2, int, true], input[3, int, true], input[4, int, true], input[5, int, true])
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224)
at org.apache.spark.sql.execution.python.PythonUDF.doGenCode(PythonUDF.scala:27)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:740)
at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:740)
...
我尝试了很多东西,我试图只用一个参数和参数调用方法......但没有帮助。
它的一种方法是使用新列创建一个新的数据框,如下所示:
df_src_grp_hr_d = df_src.select('*', concat(
col("year"),
lit("-"),
col("month"),
lit("-"),
col("day"),
lit(" "),
col("hour"),
lit(":0")).alias('time'))`
之后我可以将列转换为时间戳:
df_src_grp_hr_to_timestamp = df_src_grp_hr_d.select(
df_src_grp_hr_d['src_ip'],
df_src_grp_hr_d['year'],
df_src_grp_hr_d['month'],
df_src_grp_hr_d['day'],
df_src_grp_hr_d['hour'],
df_src_grp_hr_d['time'].cast('timestamp'))
答案 0 :(得分:1)
好吧..我想我理解了这个问题......原因是因为我的dataFrame只是在内存中加载了大量数据导致show()
操作失败。
我意识到这是导致异常的原因:
Py4JJavaError: An error occurred while calling o2108.showString.
: java.lang.UnsupportedOperationException: Cannot evaluate expression:
实际上是df.show()
行动。
我可以通过执行以下代码片段来确认: Convert pyspark string to date format
from datetime import datetime
from pyspark.sql.functions import col,udf, unix_timestamp
from pyspark.sql.types import DateType
# Creation of a dummy dataframe:
df1 = sqlContext.createDataFrame([("11/25/1991","11/24/1991","11/30/1991"),
("11/25/1391","11/24/1992","11/30/1992")], schema=['first', 'second', 'third'])
# Setting an user define function:
# This function converts the string cell into a date:
func = udf (lambda x: datetime.strptime(x, '%M/%d/%Y'), DateType())
df = df1.withColumn('test', func(col('first')))
df.show()
df.printSchema()
哪个有效!但它仍无法使用我的dataFrame df_src
。
原因是因为我从我的数据库服务器(比如超过8到9百万行)中加载了大量内存中的数据,看起来火花无法在udf中执行{{{ 1}}(默认情况下显示20个条目)在dataFrame中加载的结果。
即使调用show(n = 1),也会抛出相同的异常。
但是如果调用printSchema(),您将看到新列被有效添加。
查看是否添加新列的一种方法是简单地调用操作.show()
。
最后,使其工作的一种方法是影响新的数据帧,而不是在select()中调用udf时调用print dataFrame.take(10)
:
.show()
然后缓存它:
df_to_string = df_src.select('*',
u_parse_df_to_string(df_src['year'], df_src['month'], df_src['day'], df_src['hour'], df_src['minute'])
)
现在可以调用df_to_string.cache
而不会出现任何问题:
.show()