我有一个rdd:
@numba.jit(nopython = True)
def foo(X,N):
'''
:param X: 1D numpy array
:param N: Integer >= 2
:rtype: 2D numpy array of shape len(X) x N
'''
out = np.ones((X.shape[0],N))
out[:,1] = X
for i in range(2,N):
out[:,i] = X*out[:,i-1] - (i-1)*out[:,i-2]
c = 1
for i in range(2,N):#Note that this loop cannot be combined with the one above!
c *= i
out[:,i] /= math.sqrt(c)
return out
我想将其转换为带有索引的Spark数据框:
a,1,2,3,4
b,4,6
c,8,9,10,11
我尝试拆分RDD:
df:
Index Name Number
0 a 1,2,3,4
1 b 4,6
2 c 8,9,10,11
但结果是:
parts = rdd.flatMap(lambda x: x.split(","))
如何将RDD拆分并转换为pyspark中的Dataframe,使第一个元素作为第一列,其余元素合并为一列?
如解决方案中所述:
a,
1,
2,
3,...
输出:
rd = rd1.map(lambda x: x.split("," , 1) ).zipWithIndex()
rd.take(3)
下一步:
[(['a', '1,2,3,4'], 0),
(['b', '4,6'], 1),
(['c', '8,9,10,11'], 2)]
我收到以下错误:
rd2=rd2=rd.map(lambda x,y: (y, x[0] , x[1]) ).toDF(["index", "name" ,"number"])
rd2.collect()
这是版本的问题吗?
答案 0 :(得分:5)
以下RDD转换,您将会很高兴。
df = rdd.map(lambda x: x.split("," , 1) ) # Split only at first occurence of ,
.zipWithIndex() # Add an incrementing index to each element
.map(lambda (x,y) : (y, x[0] , x[1]) ) # flatten the structure
.toDF(["index", "name" , "number"]) # Convert to dataframe
df.show()
#+-----+----+---------+
#|index|name| number|
#+-----+----+---------+
#| 0| a| 1,2,3,4|
#| 1| b| 4,6|
#| 2| c|8,9,10,11|
#+-----+----+---------+