使用RDD列表作为数据帧过滤操作的参数

时间:2017-09-15 09:09:56

标签: pyspark spark-dataframe rdd pyspark-sql apache-spark-2.0

我有以下代码段。

1   id      Primary int(11)
2   name    text
3   website text
4   logo    text
5   year    text
6   origin  text
7   offices text
8   employees   text
9   languages   text
10  phone   text
11  address text
12  types   text
13  platforms   text
14  mindeposit  text
15  minaccount  text
16  mintradesize    text
17  maxtradesize    text
18  bonus   text
19  payouts text
20  accounttypes    text
21  demo    text
22  demourl text
23  liveurl text
24  depositmethods  text
25  withdrawal  text
26  tradingmethods  text
27  assetinfo   text
28  numberassets    text
29  expiry  text
30  accountcurrency text
31  tradingcurrency text
32  usonly  text
33  regulated   text
34  regulatedtext   text
35  licensed    text
36  licensedtext    text
37  commissions text
38  commisionsstext text
39  fees    text
40  feestext    text
41  mobiletrading   text
42  tablettrading   text
43  scamhistory text
44  leverage    text
45  spread  text
46  overallscore    text
47  pros    text
48  cons    text
49  features    text
50  siterating  text
51  status  text
52  review  text
53  about   text
54  publiclisting   text
55  techsupport text
56  spreadbetting   text
57  binaryoptions   text
58  blockedcountries    text
59  acceptedcountries   text
60  avgspreadeurusd text
61  avgspreadgbpusd text
62  avgspreadgold   text
63  fractionalpip   text
64  demoexpiration  text
65  eacompitable    text
66  eaallowed   text
67  liquidityproviders  text
68  cashback    text
69  tradingsignals  text
70  freevps text
71  nativesupport   text
72  24hours text
73  centaccount text
74  miniaccount text
75  standardaccount text
76  vipaccount  text
77  ecnaccount  text
78  maxleverageforex    text
79  maxleveragecommodities  text
80  maxleverageindices  text
81  maxleveragecfd  text
82  stopout text
83  executiontype   text
84  availableindices    text
85  availablecommodities    text
86  availablecfd    text
87  bitcoin text
88  usdindex    text
89  limitorder  text
90  chartingpackage text
91  newsstreaming   text
92  marketorder text
93  stoporder   text
94  tradeoffcharts  text
95  rolloverfee text
96  newstrading text
97  personalaccmanager  text
98  livechat    text
99  smsnotifications    text
100 swapaccounts    text
101 segregatedaccounts  text
102 interestonmargin    text
103 managedaccounts text
104 pamm    text
105 fax text
106 email   text
107 platformtimezone    text
108 webtrading  text
109 autotrading text
110 trustedmanagment    text
111 affiliate   text
112 advisors    text
113 education   text
114 api text
115 oco text
116 hedging text
117 dailyanal   text
118 trailingstops   text
119 oneclicktrading text
120 contests    text
121 otherinstruments    text
122 decimals    text
123 scalping    text
124 screenshot1 text
125 screenshot2 text
126 screenshot3 text
127 resource    text
128 published   tinyint(1)

现在,我有一个RDD,就像:

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import * 

sc = SparkContext()
spark = SparkSession.builder.appName("test").getOrCreate()

schema = StructType([                                                                           
         StructField("name", StringType(), True),
         StructField("a", StringType(), True),
         StructField("b", StringType(), True),
         StructField("c", StringType(), True),
         StructField("d", StringType(), True),
         StructField("e", StringType(), True),
         StructField("f", StringType(), True)])

arr = [("Alice", "1", "2", None, "red", None, None), \
       ("Bob", "1", None, None, None, None, "apple"), \
       ("Charlie", "2", "3", None, None, None, "orange")]

df = spark.createDataFrame(arr, schema)
df.show()

#+-------+---+----+----+----+----+------+
#|   name|  a|   b|   c|   d|   e|     f|
#+-------+---+----+----+----+----+------+
#|  Alice|  1|   2|null| red|null|  null|
#|    Bob|  1|null|null|null|null| apple|  
#|Charlie|  2|   3|null|null|null|orange|
#+-------+---+----+----+----+----+------+

我的目标是找到具有空属性子集的名称,即上例:

lrdd = sc.parallelize([['a', 'b'], ['c', 'd', 'e'], ['f']])

现在,我找到了一个相当天真的解决方案,即收集列表,然后循环查询数据帧的子集。

{'c,d,e': ['Bob', 'Charlie'], 'f': ['Alice']}

哪个有效,但效率很低。 还要考虑目标属性模式类似于10000个属性,导致 lrdd 中有600多个不相交的列表。

所以,我的问题是: 如何有效地使用分布式集合的内容作为查询sql数据帧的参数? 任何提示都表示赞赏。

非常感谢。

2 个答案:

答案 0 :(得分:1)

您应该重新考虑数据的格式。您应explode而不是拥有这么多列,以获得更多行以允许分布式计算:

import pyspark.sql.functions as psf
df = df.select(
    "name", 
    psf.explode(
        psf.array(
            *[psf.struct(
                psf.lit(c).alias("feature_name"), 
                df[c].alias("feature_value")
            ) for c in df.columns if c != "name"]
        )
    ).alias("feature")
).select("name", "feature.*")

    +-------+------------+-------------+
    |   name|feature_name|feature_value|
    +-------+------------+-------------+
    |  Alice|           a|            1|
    |  Alice|           b|            2|
    |  Alice|           c|         null|
    |  Alice|           d|          red|
    |  Alice|           e|         null|
    |  Alice|           f|         null|
    |    Bob|           a|            1|
    |    Bob|           b|         null|
    |    Bob|           c|         null|
    |    Bob|           d|         null|
    |    Bob|           e|         null|
    |    Bob|           f|        apple|
    |Charlie|           a|            2|
    |Charlie|           b|            3|
    |Charlie|           c|         null|
    |Charlie|           d|         null|
    |Charlie|           e|         null|
    |Charlie|           f|       orange|
    +-------+------------+-------------+

我们会对lrdd做同样的事情,但我们会先改变一下:

subsets = spark\
    .createDataFrame(lrdd.map(lambda l: [l]), ["feature_set"])\
    .withColumn("feature_name", psf.explode("feature_set"))

    +-----------+------------+
    |feature_set|feature_name|
    +-----------+------------+
    |     [a, b]|           a|
    |     [a, b]|           b|
    |  [c, d, e]|           c|
    |  [c, d, e]|           d|
    |  [c, d, e]|           e|
    |        [f]|           f|
    +-----------+------------+

现在,我们可以在feature_name上加入这些内容,并过滤feature_setname feature_value为空的broadcast。如果lrdd表不是太大,你应该df_join = df.join(psf.broadcast(subsets), "feature_name") res = df_join.groupBy("feature_set", "name").agg( psf.count("*").alias("count"), psf.sum(psf.isnull("feature_value").cast("int")).alias("nb_null") ).filter("nb_null = count") +-----------+-------+-----+-------+ |feature_set| name|count|nb_null| +-----------+-------+-----+-------+ | [c, d, e]|Charlie| 3| 3| | [f]| Alice| 1| 1| | [c, d, e]| Bob| 3| 3| +-----------+-------+-----+-------+

groupBy

之后您可以随时feature_set Invalid index 11, size is 6

答案 1 :(得分:0)

您可以尝试这种方法。

首先交叉连接两个数据帧

    from pyspark.sql.types import *
    lrdd = sc.parallelize([['a', 'b'], ['c', 'd', 'e'], ['f']]).
                         map(lambda x: ("key", x))

    schema = StructType([StructField("K", StringType()),
                         StructField("X", ArrayType(StringType()))])

    df2 = spark.createDataFrame(lrdd, schema).select("X")
    df3 = df.crossJoin(df2)

交叉连接的结果

    +-------+---+----+----+----+----+------+---------+
|   name|  a|   b|   c|   d|   e|     f|        X|
+-------+---+----+----+----+----+------+---------+
|  Alice|  1|   2|null| red|null|  null|   [a, b]|
|  Alice|  1|   2|null| red|null|  null|[c, d, e]|
|  Alice|  1|   2|null| red|null|  null|      [f]|
|    Bob|  1|null|null|null|null| apple|   [a, b]|
|Charlie|  2|   3|null|null|null|orange|   [a, b]|
|    Bob|  1|null|null|null|null| apple|[c, d, e]|
|    Bob|  1|null|null|null|null| apple|      [f]|
|Charlie|  2|   3|null|null|null|orange|[c, d, e]|
|Charlie|  2|   3|null|null|null|orange|      [f]|
+-------+---+----+----+----+----+------+---------+

现在使用udf

过滤出行
from pyspark.sql.functions import udf, struct, collect_list 

def foo(data):

    d = list(filter(lambda x: data[x], data['X']))
    print(d)
    if len(d)>0:
        return(False)
    else:
        return(True)

udf_foo = udf(foo, BooleanType())

df4 = df3.filter(udf_foo(struct([df3[x] for x in df3.columns]))).select("name", 'X')



df4.show()
+-------+---------+
|   name|        X|
+-------+---------+
|  Alice|      [f]|
|    Bob|[c, d, e]|
|Charlie|[c, d, e]|
+-------+---------+

然后使用groupby和collect_list获取所需的输出

df4.groupby("X").agg(collect_list("name").alias("name")).show()
 +--------------+---------+
 |   name       |        X|
 +--------------+---------+
 | [ Alice]     |      [f]|
 |[Bob, Charlie]|[c, d, e]|
 +--------------+---------+