我正在尝试根据列中的值是否等于列表来过滤Spark数据帧。我想做这样的事情:
filtered_df = df.where(df.a == ['list','of' , 'stuff'])
其中filtered_df
仅包含filtered_df.a
的值为['list','of' , 'stuff']
且a
的类型为array (nullable = true)
的行。
答案 0 :(得分:8)
您可以创建一个udf。例如:
volatile uint32_t time_ovf = 0;
void SysTick_Handler(void)
{
time_ovf += 1000;
}
uint32_t micros(void)
{
__disable_irq(); // asm("cpsid i");
uint32_t m = time_ovf;
uint32_t t = SysTick->VAL; // TODO: assume HCLK = 72 MHz
// Check pending overflow IRQ after we disabled interrupts
uint32_t o = SCB->ICSR & SCB_ICSR_PENDSTSET_Msk;
// If overflow and counter rolled over, add 1000 to the microseconds count
if (o && t > 36000) m+= 1000;
__enable_irq(); // asm("cpsie i");
// Systick counts downwards, so subtract it from 999.
return (m + (999 - t / 72));
}
答案 1 :(得分:7)
<强>更新强>:
使用当前版本,您可以使用array
文字:
from pyspark.sql.functions import array, lit
df.where(df.a == array(*[lit(x) for x in ['list','of' , 'stuff']]))
原始回答:
嗯,有点hacky的方法,这不需要Python批处理作业,是这样的:
from pyspark.sql.functions import col, lit, size
from functools import reduce
from operator import and_
def array_equal(c, an_array):
same_size = size(c) == len(an_array) # Check if the same size
# Check if all items equal
same_items = reduce(
and_,
(c.getItem(i) == an_array[i] for i in range(len(an_array)))
)
return and_(same_size, same_items)
快速测试:
df = sc.parallelize([
(1, ['list','of' , 'stuff']),
(2, ['foo', 'bar']),
(3, ['foobar']),
(4, ['list','of' , 'stuff', 'and', 'foo']),
(5, ['a', 'list','of' , 'stuff']),
]).toDF(['id', 'a'])
df.where(array_equal(col('a'), ['list','of' , 'stuff'])).show()
## +---+-----------------+
## | id| a|
## +---+-----------------+
## | 1|[list, of, stuff]|
## +---+-----------------+
答案 2 :(得分:1)
您可以结合使用“ array”,“ lit”和“ array_except”功能来实现此目的。
lit(array(lit("list"),lit("of"),lit("stuff"))
["list", "of", "stuff"]
相同注意:spark_2.4.0中提供了array_except函数。
代码如下:
# Import libraries
from pyspark.sql.functions import *
# Create DataFrame
df = sc.parallelize([
(1, ['list','of' , 'stuff']),
(2, ['foo', 'bar']),
(3, ['foobar']),
(4, ['list','of' , 'stuff', 'and', 'foo']),
(5, ['a', 'list','of' , 'stuff']),
]).toDF(['id', 'a'])
# Solution
df1 = df.filter(size(array_except(df["a"], lit(array(lit("list"),lit("of"),lit("stuff"))))) == 0)
# Display result
df1.show()
+---+-----------------+
| id| a|
+---+-----------------+
| 1|[list, of, stuff]|
+---+-----------------+
我希望这会有所帮助。