如何加入DataFrames并按时间戳取最新的行?

时间:2017-11-13 15:22:25

标签: python apache-spark pyspark

我有两个PySpark DataFrame。我按以下方式加入两个DataFrame:

df2

DataFrame timestamp有一列df1,而df1 = col1 col2 AA 11 BB 22 df2 = timestamp col1 col2 col3 1510586134 AA 11 3 1510586140 AA 11 2 1510586200 AA 11 5 1510586134 BB 22 3 则没有:

df2

如何根据timestampcol1 col2 col3 AA 11 5 BB 22 3 的最新行加入DataFrame?

结果应如下:

16:01:31,580 WARN  [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (default task-36) SQL Error: 0, SQLState: null
16:01:31,581 ERROR [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (default task-36) javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:jboss/datasources/OracleDS
16:01:31,585 ERROR [org.jboss.as.ejb3] (default task-36) javax.ejb.EJBTransactionRolledbackException: org.hibernate.exception.GenericJDBCException: Could not open connection
16:01:31,586 ERROR [org.jboss.as.ejb3.invocation] (default task-36) JBAS014134: EJB Invocation failed on component FidaRoleConfigurationDao for method public java.util.List ag.rul.safex.fida.authentication.dataaccess.FidaRoleConfigurationDao.findAllByExternalRoles(java.util.Set): javax.ejb.EJBTransactionRolledbackException: org.hibernate.exception.GenericJDBCException: Could not open connection
    at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleInCallerTx(CMTTxInterceptor.java:163) [wildfly-ejb3-8.2.0.Final.jar:8.2.0.Final]
    at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:253) [wildfly-ejb3-8.2.0.Final.jar:8.2.0.Final]
    at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:342) [wildfly-ejb3-8.2.0.Final.jar:8.2.0.Final]
    at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:239) [wildfly-ejb3-8.2.0.Final.jar:8.2.0.Final]
    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:309)
    at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [wildfly-ejb3-8.2.0.Final.jar:8.2.0.Final]
    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:309)
    at org.jboss.as.ejb3.component.invocationmetrics.WaitTimeInterceptor.processInvocation(WaitTimeInterceptor.java:43) [wildfly-ejb3-8.2.0.Final.jar:8.2.0.Final]

1 个答案:

答案 0 :(得分:1)

希望这有帮助!

from pyspark.sql.functions import col, rank
from pyspark.sql.window import Window

#sample data
df1 = sc.parallelize([
    ['AA', 11],
    ['BB', 22]
]).toDF(('col1', 'col2'))
df2 = sc.parallelize([
    [1510586134, 'AA', 11, 3],
    [1510586140, 'AA', 11, 2],
    [1510586200, 'AA', 11, 5],
    [1510586134, 'BB', 22, 3]
]).toDF(('timestamp', 'col1', 'col2', 'col3'))

#select latest row of df2 according to timestamp
df2_temp = df2.withColumn('timestamp_format_col', col('timestamp').cast("timestamp"))
window = Window.partitionBy('col1','col2').\
    orderBy(col('timestamp_format_col').desc())
df2_temp = df2_temp.\
    select('*', rank().over(window).alias('rank')).\
    filter(col('rank')==1).\
    drop('rank','timestamp','timestamp_format_col')

#final result
df = df1.join(df2_temp, ['col1', 'col2'], 'inner')
df.show()

输出是:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  BB|  22|   3|
|  AA|  11|   5|
+----+----+----+