我有以下代码段。
1 id Primary int(11)
2 name text
3 website text
4 logo text
5 year text
6 origin text
7 offices text
8 employees text
9 languages text
10 phone text
11 address text
12 types text
13 platforms text
14 mindeposit text
15 minaccount text
16 mintradesize text
17 maxtradesize text
18 bonus text
19 payouts text
20 accounttypes text
21 demo text
22 demourl text
23 liveurl text
24 depositmethods text
25 withdrawal text
26 tradingmethods text
27 assetinfo text
28 numberassets text
29 expiry text
30 accountcurrency text
31 tradingcurrency text
32 usonly text
33 regulated text
34 regulatedtext text
35 licensed text
36 licensedtext text
37 commissions text
38 commisionsstext text
39 fees text
40 feestext text
41 mobiletrading text
42 tablettrading text
43 scamhistory text
44 leverage text
45 spread text
46 overallscore text
47 pros text
48 cons text
49 features text
50 siterating text
51 status text
52 review text
53 about text
54 publiclisting text
55 techsupport text
56 spreadbetting text
57 binaryoptions text
58 blockedcountries text
59 acceptedcountries text
60 avgspreadeurusd text
61 avgspreadgbpusd text
62 avgspreadgold text
63 fractionalpip text
64 demoexpiration text
65 eacompitable text
66 eaallowed text
67 liquidityproviders text
68 cashback text
69 tradingsignals text
70 freevps text
71 nativesupport text
72 24hours text
73 centaccount text
74 miniaccount text
75 standardaccount text
76 vipaccount text
77 ecnaccount text
78 maxleverageforex text
79 maxleveragecommodities text
80 maxleverageindices text
81 maxleveragecfd text
82 stopout text
83 executiontype text
84 availableindices text
85 availablecommodities text
86 availablecfd text
87 bitcoin text
88 usdindex text
89 limitorder text
90 chartingpackage text
91 newsstreaming text
92 marketorder text
93 stoporder text
94 tradeoffcharts text
95 rolloverfee text
96 newstrading text
97 personalaccmanager text
98 livechat text
99 smsnotifications text
100 swapaccounts text
101 segregatedaccounts text
102 interestonmargin text
103 managedaccounts text
104 pamm text
105 fax text
106 email text
107 platformtimezone text
108 webtrading text
109 autotrading text
110 trustedmanagment text
111 affiliate text
112 advisors text
113 education text
114 api text
115 oco text
116 hedging text
117 dailyanal text
118 trailingstops text
119 oneclicktrading text
120 contests text
121 otherinstruments text
122 decimals text
123 scalping text
124 screenshot1 text
125 screenshot2 text
126 screenshot3 text
127 resource text
128 published tinyint(1)
现在,我有一个RDD,就像:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
sc = SparkContext()
spark = SparkSession.builder.appName("test").getOrCreate()
schema = StructType([
StructField("name", StringType(), True),
StructField("a", StringType(), True),
StructField("b", StringType(), True),
StructField("c", StringType(), True),
StructField("d", StringType(), True),
StructField("e", StringType(), True),
StructField("f", StringType(), True)])
arr = [("Alice", "1", "2", None, "red", None, None), \
("Bob", "1", None, None, None, None, "apple"), \
("Charlie", "2", "3", None, None, None, "orange")]
df = spark.createDataFrame(arr, schema)
df.show()
#+-------+---+----+----+----+----+------+
#| name| a| b| c| d| e| f|
#+-------+---+----+----+----+----+------+
#| Alice| 1| 2|null| red|null| null|
#| Bob| 1|null|null|null|null| apple|
#|Charlie| 2| 3|null|null|null|orange|
#+-------+---+----+----+----+----+------+
我的目标是找到具有空属性子集的名称,即上例:
lrdd = sc.parallelize([['a', 'b'], ['c', 'd', 'e'], ['f']])
现在,我找到了一个相当天真的解决方案,即收集列表,然后循环查询数据帧的子集。
{'c,d,e': ['Bob', 'Charlie'], 'f': ['Alice']}
哪个有效,但效率很低。 还要考虑目标属性模式类似于10000个属性,导致 lrdd 中有600多个不相交的列表。
所以,我的问题是: 如何有效地使用分布式集合的内容作为查询sql数据帧的参数? 任何提示都表示赞赏。
非常感谢。
答案 0 :(得分:1)
您应该重新考虑数据的格式。您应explode
而不是拥有这么多列,以获得更多行以允许分布式计算:
import pyspark.sql.functions as psf
df = df.select(
"name",
psf.explode(
psf.array(
*[psf.struct(
psf.lit(c).alias("feature_name"),
df[c].alias("feature_value")
) for c in df.columns if c != "name"]
)
).alias("feature")
).select("name", "feature.*")
+-------+------------+-------------+
| name|feature_name|feature_value|
+-------+------------+-------------+
| Alice| a| 1|
| Alice| b| 2|
| Alice| c| null|
| Alice| d| red|
| Alice| e| null|
| Alice| f| null|
| Bob| a| 1|
| Bob| b| null|
| Bob| c| null|
| Bob| d| null|
| Bob| e| null|
| Bob| f| apple|
|Charlie| a| 2|
|Charlie| b| 3|
|Charlie| c| null|
|Charlie| d| null|
|Charlie| e| null|
|Charlie| f| orange|
+-------+------------+-------------+
我们会对lrdd
做同样的事情,但我们会先改变一下:
subsets = spark\
.createDataFrame(lrdd.map(lambda l: [l]), ["feature_set"])\
.withColumn("feature_name", psf.explode("feature_set"))
+-----------+------------+
|feature_set|feature_name|
+-----------+------------+
| [a, b]| a|
| [a, b]| b|
| [c, d, e]| c|
| [c, d, e]| d|
| [c, d, e]| e|
| [f]| f|
+-----------+------------+
现在,我们可以在feature_name
上加入这些内容,并过滤feature_set
和name
feature_value
为空的broadcast
。如果lrdd表不是太大,你应该df_join = df.join(psf.broadcast(subsets), "feature_name")
res = df_join.groupBy("feature_set", "name").agg(
psf.count("*").alias("count"),
psf.sum(psf.isnull("feature_value").cast("int")).alias("nb_null")
).filter("nb_null = count")
+-----------+-------+-----+-------+
|feature_set| name|count|nb_null|
+-----------+-------+-----+-------+
| [c, d, e]|Charlie| 3| 3|
| [f]| Alice| 1| 1|
| [c, d, e]| Bob| 3| 3|
+-----------+-------+-----+-------+
它
groupBy
之后您可以随时feature_set
Invalid index 11, size is 6
答案 1 :(得分:0)
您可以尝试这种方法。
首先交叉连接两个数据帧
from pyspark.sql.types import *
lrdd = sc.parallelize([['a', 'b'], ['c', 'd', 'e'], ['f']]).
map(lambda x: ("key", x))
schema = StructType([StructField("K", StringType()),
StructField("X", ArrayType(StringType()))])
df2 = spark.createDataFrame(lrdd, schema).select("X")
df3 = df.crossJoin(df2)
交叉连接的结果
+-------+---+----+----+----+----+------+---------+
| name| a| b| c| d| e| f| X|
+-------+---+----+----+----+----+------+---------+
| Alice| 1| 2|null| red|null| null| [a, b]|
| Alice| 1| 2|null| red|null| null|[c, d, e]|
| Alice| 1| 2|null| red|null| null| [f]|
| Bob| 1|null|null|null|null| apple| [a, b]|
|Charlie| 2| 3|null|null|null|orange| [a, b]|
| Bob| 1|null|null|null|null| apple|[c, d, e]|
| Bob| 1|null|null|null|null| apple| [f]|
|Charlie| 2| 3|null|null|null|orange|[c, d, e]|
|Charlie| 2| 3|null|null|null|orange| [f]|
+-------+---+----+----+----+----+------+---------+
现在使用udf
过滤出行from pyspark.sql.functions import udf, struct, collect_list
def foo(data):
d = list(filter(lambda x: data[x], data['X']))
print(d)
if len(d)>0:
return(False)
else:
return(True)
udf_foo = udf(foo, BooleanType())
df4 = df3.filter(udf_foo(struct([df3[x] for x in df3.columns]))).select("name", 'X')
df4.show()
+-------+---------+
| name| X|
+-------+---------+
| Alice| [f]|
| Bob|[c, d, e]|
|Charlie|[c, d, e]|
+-------+---------+
然后使用groupby和collect_list获取所需的输出
df4.groupby("X").agg(collect_list("name").alias("name")).show()
+--------------+---------+
| name | X|
+--------------+---------+
| [ Alice] | [f]|
|[Bob, Charlie]|[c, d, e]|
+--------------+---------+