我有一个500 x 26的维数组。使用pyspark中的过滤器操作,我想挑选第i行另一个数组中列出的列。例如:如果是
a[i]= [1 2 3]
然后选出第1,2和3列以及所有行。可以用filter命令完成吗?如果是,有人可以显示示例或语法吗?
答案 0 :(得分:2)
Sounds like you need to filter columns, but not records. Fo doing this you need to use Spark's map function - to transform every row of your array represented as an RDD. See in my example:
# generate 13 x 10 array and creates rdd with 13 records, each record contains a list with 10 elements
rdd = sc.parallelize([range(10) for i in range(13)])
def make_selector(cols):
"""use closure to configure select_col function
:param cols: list - contains columns' indexes to select from every record
"""
def select_cols(record):
return [record[c] for c in cols]
return select_cols
s = make_selector([1,2])
s([0,1,2])
>>> [1, 2]
rdd.map(make_selector([0, 3, 9])).take(5)
results in
[[0, 3, 9], [0, 3, 9], [0, 3, 9], [0, 3, 9], [0, 3, 9]]
答案 1 :(得分:0)
这与@ vvladymyrov的答案基本相同,但没有闭包:
rdd = sc.parallelize([range(10) for i in range(13)])
columns = [0,3,9]
rdd.map(lambda record: [record[c] for c in columns]).take(5)
结果
[[0, 3, 9], [0, 3, 9], [0, 3, 9], [0, 3, 9], [0, 3, 9]]