获取java.lang.UnsupportedOperationException:无法计算Pyspark

时间:2017-03-02 08:09:22

标签: apache-spark pyspark apache-spark-sql udf pyspark-sql

在我的项目中间,我遇到了这个不支持的操作异常。这是我的场景,我创建了一个名为filter的udf并将其注册为fnGetChargeInd。此函数将4个参数作为unicode时间戳,该时间戳已经从查询格式化为datetime类型,字符串频率,字符串begmonth和字符串currperiod。通过它计算chargeAmt并返回一个Integer类型值。这是我的udf函数代码

def filter(startdate, frequency, begmonth, testperiod):
    startdatestring = startdate.strftime("%Y-%m-%d")
    # print "startdatestring->", startdatestring
    startdateyearstring = startdatestring[0:4]
    startdatemonthstring = startdatestring[5:7]
    # print "startdateyearstring->", startdateyearstring
    startdateyearint = int(startdateyearstring)
    startdatemonthint = int(startdatemonthstring)
    # print "startdateyearint is->", startdateyearint
    # print "startdateyearinttype", type(startdateyearint)
    currYear = startdateyearint
    currMonth = startdatemonthint
    currperiod = startdateyearstring + startdatemonthstring
    if (frequency == 'M'):
        return 1
    if (frequency == 'S' or frequency == 'A' and begmonth != None):
        currMonth = int(begmonth)
        print"in if statement", currMonth
    # check nextperiod calculation
    if (currperiod == testperiod):
        return 1
    if (currperiod > testperiod):
        return 0
    if (frequency == 'Q'):
        currMonth = currMonth + 3
    if (frequency == 'S'):
        currMonth = currMonth + 1
    if (currMonth > 12):
        currMonth = currMonth - 12
        currYear = currYear + 1
    return 0

这是我的TimestampConversion代码,用于将unicode格式化为datetime

def StringtoTimestamp(datetext):
    if(datetext==None):
        return None
    else:
        datevalue = datetime.datetime.strptime(datetext, "%b %d %Y %H:%M:%S:%f%p")
        return datevalue

spark.udf.register('TimestampConvert',lambda datetext:StringtoTimestamp(datetext),TimestampType()

spark.udf.register("fnGetChargeInd",lambda x,y,z,timeperiod:filter(x,y,z,timeperiod),IntegerType())

现在我已经查询了chargeAmt计算表

spark.sql("select b.ENTITYID as ENTITYID, cm.BLDGID as BldgID,cm.LEASID as LeaseID,coalesce(l.SUITID,(select EmptyDefault from EmptyDefault)) as SuiteID,(select CurrDate from CurrDate) as TxnDate,cm.INCCAT as IncomeCat,'??' as SourceCode,(Select CurrPeriod from CurrPeriod)as Period,coalesce(case when cm.DEPARTMENT ='@' then 'null' else cm.DEPARTMENT end, null) as Dept,'Lease' as ActualProjected ,fnGetChargeInd(TimestampConvert(cm.EFFDATE),cm.FRQUENCY,cm.BEGMONTH,('select CurrPeriod from CurrPeriod'))*coalesce (cm.AMOUNT,0) as  ChargeAmt,0 as OpenAmt,cm.CURRCODE as CurrencyCode,case when ('PERIOD.DATACLSD') is null then 'Open' else 'Closed' end as GLClosedStatus,'Unposted'as GLPostedStatus ,'Unpaid' as PaidStatus,cm.FRQUENCY as Frequency,0 as RetroPD from CMRECC cm join BLDG b on cm.BLDGID =b.BLDGID join LEAS l on cm.BLDGID =l.BLDGID and cm.LEASID =l.LEASID and (l.VACATE is null or l.VACATE >= ('select CurrDate from CurrDate')) and (l.EXPIR >= ('select CurrDate from CurrDate') or l.EXPIR < ('select RunDate from RunDate')) left outer join PERIOD on b.ENTITYID =  PERIOD.ENTITYID and ('select CurrPeriod from CurrPeriod')=PERIOD.PERIOD where ('select CurrDate from CurrDate')>=cm.EFFDATE  and (select CurrDate from CurrDate) <= coalesce(cm.EFFDATE,cast(date_add(( select min(cm2.EFFDATE) from CMRECC cm2 where cm2.BLDGID = cm.BLDGID and cm2.LEASID = cm.LEASID and cm2.INCCAT = cm.INCCAT and 'cm2.EFFDATE' > 'cm.EFFDATE'),-1) as timestamp)  ,case when l.EXPIR <(select RunDate from RunDate)then (Select RunDate from RunDate) else l.EXPIR end)").show()

完全计算chargeAmt enter image description here

我将此结果保存在Fact_Temp临时表中 现在 问题警示 我想查询一个过滤后的表,我将在删除行后得到数据,其中ActualProjected = Lease和ChargeAmt = 0

spark.sql("select * from Fact_Temp except(select * from Fact_Temp where ActualProjected='Lease' and ChargeAmt='0')").show()

它给了我例外

java.lang.UnsupportedOperationException:无法计算表达式:fnGetChargeInd(TimestampConvert(input [0,string,true]),input [1,string,true],input [2,string,true],选择CurrPeriod来自CurrPeriod)

我所知道的是,如果我在没有这个条件的情况下进行查询,那么chargeAmt没有任何价值因为它运作良好

spark.sql("select * from Fact_Temp except(select * from Fact_Temp where ActualProjected='Lease')").show()

这给了我预期的EmptyTable.Logically我认为计算后在表中设置了chargeAMt值,我已经注册了该表,所以保存了值。所以当我在保存的表上查询时。我不知道为什么函数在这里调用。我已经在stackoverflow中看过这个帖子了 UnsupportedOperationException: Cannot evalute expression: .. when adding new column withColumn() and udf()需要了解,但我的情况与此不同。我已经尝试过dataframe printschema我只看过这个temptable的模式

我如何解决这个问题,我们非常感谢任何指导。我在这里的代码中遗漏了什么。请帮助我。我正在使用Pyspark 2.0 提前致谢 Kalyan

enter image description here

1 个答案:

答案 0 :(得分:4)

好到目前为止我已经发现这是火花2.0的错误。以下链接解决了我的问题 https://issues.apache.org/jira/browse/SPARK-17100

我已经从2.0改为2.1.0,它对我有用。