使用pyspark为动态表创建XPATH行

时间:2018-07-28 17:11:28

标签: dataframe pyspark

我是pyspark(版本1.6)的新手,我对XPATHS(在经过处理的键值中)和列表文件中的值都很认真

root/catalog~1/product~1/arrived-on~1='today'
root/catalog~1/product~1/arrived-for~1='gha'
root/catalog~1/product~1/catalog-item~1/price~1='39.95'
root/catalog~1/product~1/catalog-item~1/name~1='A'
root/catalog~1/product~1/catalog-item~2/name~1='C'
root/catalog~1/product~1/catalog-item~2/price~1='49.95'
root/catalog~1/product~1/catalog-item~3/name~1='B'
root/catalog~1/product~1/catalog-item~3/price~1='100'
root/catalog~1/product~2/arrived-on~1'yesterday'
root/catalog~1/product~2/arrived-for~1='gha'
root/catalog~1/product~2/catalog-item~1/price~1='399.95'
root/catalog~1/product~2/catalog-item~1/name~1='AA'
root/catalog~1/product~2/catalog-item~2/name~1='CC'
root/catalog~1/product~2/catalog-item~2/price~1='939.95'
root/catalog~1/product~2/catalog-item~3/name~1='AB'
root/catalog~1/product~2/catalog-item~3/price~1='239'
root/catalog~2/product~1/arrived-on~1='Monday'
root/catalog~2/product~1/arrived-for~1='raf'
root/catalog~2/product~1/catalog-item~1/price~1='70'
root/catalog~2/product~1/catalog-item~1/name~1='AAA'
root/catalog~2/product~1/catalog-item~2/name~1='CCCC'
root/catalog~2/product~1/catalog-item~2/price~1='49.95'
root/catalog~2/product~1/catalog-item~2/anytag~1='anytag'
root/catalog~2/product~1/catalog-item~3/name~1='B'
root/catalog~2/product~1/catalog-item~3/price~1='D'
root/catalog~2/product~2/arrived-on~1'today'
root/catalog~2/product~2/arrived-for~1='jaf'
root/catalog~2/product~2/catalog-item~1/price~1='39.95'
root/catalog~2/product~2/catalog-item~1/name~1='A'
root/catalog~2/product~2/catalog-item~2/name~1='C'
root/catalog~2/product~2/catalog-item~2/price~1='39.95'
root/catalog~2/product~2/catalog-item~2/extratag~1='only on this'

我希望将此输出转换为表数,其中=之前的值是列名,而其前一个是表名。

在这种情况下,我最终将拥有2张桌子。 表1目录表

catalog product arrived-on  arrived-for
1   1   today   gha
1   2   yesterday   gha
2   1   Monday  raf
2   2   today   raf

表2-目录项表

catalog product catalog-item    name    price   extratag    Anytag
1   1   1   A   39.95   NULL    NULL
1   1   2   C   49.95   NULL    NULL
1   1   3   B   100 NULL    NULL
1   2   1   AA  399.95  NULL    NULL
1   2   2   CC  939.95  NULL    NULL
1   2   3   AB  239 NULL    NULL
2   1   1   AAA 70  NULL    NULL
2   1   2   CCC 49.95   NULL    anytag
2   1   3   B   D   NULL    NULL
2   2   2   C   39.95   NULL    NULL
2   2   2   A   39.95   Only on This    NULL

到目前为止,我所做的是将xpath值分成多列

import pyspark.sql.functions as f    
myrdd=sc.textfile("<fileanme>").map(lambda line: line.split("=")
df=myrdd.toDF(['xpath'],['value'])
df=df.withcolumn('columnname',f.reverse(f.split(f.reverse(df.xpath),"/")[0]))
df=df.withcolumn('tablename',f.reverse(f.split(f.reverse(df.xpath),"/")[1]))
df=df.withcolumn('hirearchy_1',f.split(f.col('xpath'),"/")[1])
df=df.withcolumn('hirearchy_3',f.split(f.col('xpath'),"/")[3])
df=df.withcolumn('hirearchy_2',f.split(f.col('xpath'),"/")[2])
df=df.withcolumn('hirearchy_4',f.split(f.col('xpath'),"/")[4])
df=df.withcolumn('layer_2',f.split(f.col('hirearchy_2'),"/")[1])
df=df.withcolumn('layer_3',f.split(f.col('hirearchy_3'),"/")[1])
df=df.withcolumn('layer_4',f.split(f.col('hirearchy_4'),"/")[1])
df=df.withcolumn('hirearchy_1',f.split(f.col('hirearchy_1'),"~")[0])
df=df.withcolumn('hirearchy_3',f.split(f.col('xpath'),"~")[0])
df=df.withcolumn('hirearchy_2',f.split(f.col('xpath'),"~")[0])
df=df.withcolumn('hirearchy_4',f.split(f.col('xpath'),"~")[0])

我想进一步进行动态拆分,以使所有/和〜成为分隔符,并使其成为列。

我想选择columname表名hirechary并用图层值进行转换,并将其带到所需的位置。我只在这里列出了第一批回教士。我正在寻找通用的东西,它将动态添加列并进行转置并为我带来所需的列。

任何建议的指针和答案都将受到高度赞赏。

1 个答案:

答案 0 :(得分:0)

在RDD中,我们可以执行更多的清理操作,而DF可以执行更多的操作。

myrdd=sc.textFile("data/xpath.txt").map(lambda x: x.replace('1=',''))

def arr(x): 
    return 'catalog-item' not in x 

def cat(x): 
    return 'catalog-item' in x
# Maybe someone can write better definitions on following 
def makeUpArr(x,head):
    line = ''
    for i in head: 
        found = ''
        for k in x:
            if k[0] == i:
                found = k[1]
        line += found  + ','
    return line[:-1]

def lastCol(y):
    line = ''
    for i in range(len(y)):
        if i%2 == 0:
            line += '(' + y[i] + ',' 
        else:
            line +=  y[i] + '),' 
    return line[:-1]


rdd_arr, rdd_cat = (myrdd.filter(f) for f in (arr , cat))
import pyspark.sql.functions as F
from pyspark.sql.functions import concat, col, lit

#---------
rdd_arrHead = rdd_arr.map(lambda x : [i.split('~')[0] for i in x.split('/') if (i != 'root')]).flatMap(lambda x : x)
header_arr = [i for i in rdd_arrHead.distinct().collect()]

rdd_arrData = rdd_arr.map(lambda x : [(i.split('~')[0] ,i.split('~')[1] ) for i in x.split('/') if ( i != 'root')]).map(lambda x : makeUpArr(x,header_arr)).map(lambda x : x.split(','))
dfArr = sqlContext.createDataFrame(rdd_arrData, header_arr)
#---------
rdd_catHead = rdd_cat.map(lambda x : [i.split('~')[0] for i in x.split('/') if (i != 'root')]).flatMap(lambda x : x)
header_cat = [i for i in rdd_catHead.distinct().collect()]

rdd_catData = rdd_cat.map(lambda x : [(i.split('~')[0] ,i.split('~')[1] ) for i in x.split('/') if ( i != 'root')]).map(lambda x : makeUpArr(x,header_cat)).map(lambda x : x.split(','))
dfData = sqlContext.createDataFrame(rdd_catData, header_cat)

#dfArr.groupBy(['catalog', 'product']).agg(F.max(col('arrived-on')).alias('arrived-on'), F.max(col('arrived-for')).alias('arrived-for')).show()
>>> dfArr.groupBy(['catalog', 'product']).agg(*[F.max(dfArr[i]) for i in header_arr if i != 'catalog' and i != 'product'  ]).show()
+-------+-------+-----------+-----------+
|catalog|product| arrived-on|arrived-for|
+-------+-------+-----------+-----------+
|      1|      1|    'today'|      'gha'|
|      2|      2|    'today'|      'jaf'|
|      1|      2|'yesterday'|      'gha'|
|      2|      1|   'Monday'|      'raf'|
+-------+-------+-----------+-----------+

#dfData.groupBy(['catalog', 'product','catalog-item']).agg(F.max(col('name')).alias('name'),F.max(col('price')).alias('price'),F.max(col('extratag')).alias('extratag'), F.max(col('Anytag')).alias('Anytag')).show()
>>> dfData.groupBy(['catalog', 'product','catalog-item']).agg(*[F.max(dfData[i]) for i in header_cat if i != 'catalog' and i != 'product' and i != 'catalog-item'  ]).show()

+-------+-------+------------+------+--------+--------------+--------+
|catalog|product|catalog-item|  name|   price|      extratag|  Anytag|
+-------+-------+------------+------+--------+--------------+--------+
|      1|      1|           3|   'B'|   '100'|              |        |
|      2|      2|           1|   'A'| '39.95'|              |        |
|      1|      1|           1|   'A'| '39.95'|              |        |
|      2|      1|           3|   'B'|     'D'|              |        |
|      1|      2|           1|  'AA'|'399.95'|              |        |
|      1|      2|           2|  'CC'|'939.95'|              |        |
|      2|      1|           2|'CCCC'| '49.95'|              |'anytag'|
|      1|      1|           2|   'C'| '49.95'|              |        |
|      1|      2|           3|  'AB'|   '239'|              |        |
|      2|      1|           1| 'AAA'|    '70'|              |        |
|      2|      2|           2|   'C'| '39.95'|'only on this'|        |
+-------+-------+------------+------+--------+--------------+--------+