我有一个DataFrame
形式:
+--------------+------------+----+
| s|variant_hash|call|
+--------------+------------+----+
|C1046::HG02024| 83779208| 0|
|C1046::HG02025| 83779208| 1|
|C1046::HG02026| 83779208| 0|
|C1047::HG00731| 83779208| 0|
|C1047::HG00732| 83779208| 1
...
我希望利用collect_list()
将其转换为:
+--------------------+-------------------------------------+
| s| feature_vector|
+--------------------+-------------------------------------+
| C1046::HG02024|[(83779208, 0), (68471259, 2)...]|
+--------------------+-------------------------------------+
其中,要素向量列是(variant_hash, call)
形式的元组列表。我计划利用groupBy
和agg(collect_list())
来完成此结果,但收到以下错误:
Traceback (most recent call last):
File "/tmp/ba6a891c-529b-4c75-a76f-8ab20f4377ba/ml_on_vds.py", line 43, in <module>
vector_df = svc_df.groupBy('s').agg(func.collect_list(('variant_hash', 'call')))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 39, in _
File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace:
py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
以下代码显示我的导入。我认为没有必要在2.0.2中导入HiveContext
和enableHiveSupport
,但我希望这样做可以解决问题。可悲的是,没有运气。有没有人有任何建议来解决这个导入问题?
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext, HiveContext
from pyspark.sql.functions import udf, hash, collect_list
from pyspark.sql.types import *
from hail import *
# Initialize the SparkSession
spark = (SparkSession.builder.appName("PopulationGenomics")
.config("spark.sql.files.openCostInBytes", "1099511627776")
.config("spark.sql.files.maxPartitionBytes", "1099511627776")
.config("spark.hadoop.io.compression.codecs", "org.apache.hadoop.io.compress.DefaultCodec,is.hail.io.compress.BGzipCodec,org.apache.hadoop.io.compress.GzipCodec")
.enableHiveSupport()
.getOrCreate())
我正在尝试在gcloud数据集群上运行此代码。
答案 0 :(得分:1)
所以它会在这一行引发错误 -
vector_df = svc_df.groupBy('s').agg(func.collect_list(('variant_hash', 'call')))
您呼叫collect_list
为func.collect_list
但您导入的函数为 -
from pyspark.sql.functions import udf, hash, collect_list
可能意味着将功能导入为&#39; func&#39;喜欢
from pyspark.sql import functions as func
,