我有一个RDD,每行都有以下结构:
// Rest of program...
#define NUMBER_OF_ITERATIONS 2
int main(void)
{
int args[NUMBER_OF_ITERATIONS];
pthread_t threads[NUMBER_OF_ITERATIONS][2];
// Create threads
for (unsigned i = 0; i < NUMBER_OF_ITERATIONS; ++i)
{
args[i] = i;
pthread_create(&threads[i][0], NULL, foo, &args[i]);
pthread_create(&threads[i][1], NULL, bar, &args[i]);
}
// Wait for threads to finish
for (unsigned i = 0; i < NUMBER_OF_ITERATIONS; ++i)
{
pthread_join(threads[i][0]);
pthread_join(threads[i][1]);
}
return 0;
}
我必须完成整个RDD并计算列表项的数量。我尝试过这样的事情:
[(id,[listItem,listItem,ListItem])]
但Python不允许我在lambda函数中为theCount = 0
theRDD.foreach(lambda x: theCount = theCount + x[1].count())
return theCount
赋值。有谁知道如何实现这个目标?
答案 0 :(得分:1)
这样的东西?
sc.parallelize([('id', [1, 2, 3])]).map(lambda tup: (tup[0], len(tup[1]))).collect()
输出
[('id', 3)]
Spark不会在整个集群中的作业中发送局部变量(即使在单个本地节点上运行)。这就是为什么你所拥有的语法是不可能的。
答案 1 :(得分:0)
也许
from operator import add
lst = [1,3,5,7,9]
print("{}".format(len(lst)))
ps_lst = sc.parallelize(lst)
print("{}".format(ps_lst.map(lambda x: 1).reduce(add)))
5
5