我有两个PySpark DataFrame。我按以下方式加入两个DataFrame:
df2
DataFrame timestamp
有一列df1
,而df1 =
col1 col2
AA 11
BB 22
df2 =
timestamp col1 col2 col3
1510586134 AA 11 3
1510586140 AA 11 2
1510586200 AA 11 5
1510586134 BB 22 3
则没有:
df2
如何根据timestamp
按col1 col2 col3
AA 11 5
BB 22 3
的最新行加入DataFrame?
结果应如下:
16:01:31,580 WARN [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (default task-36) SQL Error: 0, SQLState: null
16:01:31,581 ERROR [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (default task-36) javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:jboss/datasources/OracleDS
16:01:31,585 ERROR [org.jboss.as.ejb3] (default task-36) javax.ejb.EJBTransactionRolledbackException: org.hibernate.exception.GenericJDBCException: Could not open connection
16:01:31,586 ERROR [org.jboss.as.ejb3.invocation] (default task-36) JBAS014134: EJB Invocation failed on component FidaRoleConfigurationDao for method public java.util.List ag.rul.safex.fida.authentication.dataaccess.FidaRoleConfigurationDao.findAllByExternalRoles(java.util.Set): javax.ejb.EJBTransactionRolledbackException: org.hibernate.exception.GenericJDBCException: Could not open connection
at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleInCallerTx(CMTTxInterceptor.java:163) [wildfly-ejb3-8.2.0.Final.jar:8.2.0.Final]
at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:253) [wildfly-ejb3-8.2.0.Final.jar:8.2.0.Final]
at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:342) [wildfly-ejb3-8.2.0.Final.jar:8.2.0.Final]
at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:239) [wildfly-ejb3-8.2.0.Final.jar:8.2.0.Final]
at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:309)
at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [wildfly-ejb3-8.2.0.Final.jar:8.2.0.Final]
at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:309)
at org.jboss.as.ejb3.component.invocationmetrics.WaitTimeInterceptor.processInvocation(WaitTimeInterceptor.java:43) [wildfly-ejb3-8.2.0.Final.jar:8.2.0.Final]
答案 0 :(得分:1)
希望这有帮助!
from pyspark.sql.functions import col, rank
from pyspark.sql.window import Window
#sample data
df1 = sc.parallelize([
['AA', 11],
['BB', 22]
]).toDF(('col1', 'col2'))
df2 = sc.parallelize([
[1510586134, 'AA', 11, 3],
[1510586140, 'AA', 11, 2],
[1510586200, 'AA', 11, 5],
[1510586134, 'BB', 22, 3]
]).toDF(('timestamp', 'col1', 'col2', 'col3'))
#select latest row of df2 according to timestamp
df2_temp = df2.withColumn('timestamp_format_col', col('timestamp').cast("timestamp"))
window = Window.partitionBy('col1','col2').\
orderBy(col('timestamp_format_col').desc())
df2_temp = df2_temp.\
select('*', rank().over(window).alias('rank')).\
filter(col('rank')==1).\
drop('rank','timestamp','timestamp_format_col')
#final result
df = df1.join(df2_temp, ['col1', 'col2'], 'inner')
df.show()
输出是:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| BB| 22| 3|
| AA| 11| 5|
+----+----+----+