将Pyspark RDD拆分为不同的列并转换为Dataframe

时间:2018-04-04 08:26:53

标签: python apache-spark dataframe pyspark rdd

我有一个rdd:

@numba.jit(nopython = True)
def foo(X,N):
    '''
    :param X: 1D numpy array
    :param N: Integer >= 2
    :rtype: 2D numpy array of shape len(X) x N
    '''
    out = np.ones((X.shape[0],N))
    out[:,1] = X
    for i in range(2,N):
        out[:,i] = X*out[:,i-1] - (i-1)*out[:,i-2] 
    c = 1
    for i in range(2,N):#Note that this loop cannot be combined with the one above!
        c *= i
        out[:,i] /= math.sqrt(c)
    return out

我想将其转换为带有索引的Spark数据框:

a,1,2,3,4
b,4,6
c,8,9,10,11

我尝试拆分RDD:

df:

Index  Name  Number
 0      a     1,2,3,4
 1      b     4,6
 2      c     8,9,10,11

但结果是:

parts = rdd.flatMap(lambda x: x.split(","))

如何将RDD拆分并转换为pyspark中的Dataframe,使第一个元素作为第一列,其余元素合并为一列?

如解决方案中所述:

a,
1,
2,
3,...

输出:

rd = rd1.map(lambda x: x.split("," , 1) ).zipWithIndex()
rd.take(3)

下一步:

[(['a', '1,2,3,4'], 0),
(['b', '4,6'], 1),
(['c', '8,9,10,11'], 2)]

我收到以下错误:

rd2=rd2=rd.map(lambda x,y: (y, x[0] , x[1]) ).toDF(["index", "name" ,"number"])
rd2.collect()

这是版本的问题吗?

1 个答案:

答案 0 :(得分:5)

以下RDD转换,您将会很高兴。

df = rdd.map(lambda x: x.split("," , 1) )      # Split only at first occurence of ,
        .zipWithIndex()                        # Add an incrementing index to each element
        .map(lambda (x,y) : (y, x[0] , x[1]) ) # flatten the structure
        .toDF(["index", "name" , "number"])    # Convert to dataframe

df.show()

#+-----+----+---------+
#|index|name|   number|
#+-----+----+---------+
#|    0|   a|  1,2,3,4|
#|    1|   b|      4,6|
#|    2|   c|8,9,10,11|
#+-----+----+---------+