列在pySpark中不可迭代

时间:2017-03-13 00:31:28

标签: apache-spark pyspark apache-spark-sql spark-dataframe

所以,我们有点困惑。在Jupyter Notebook中,我们有以下数据框:

Variable name: _JAVA_OPTIONS
Variable value: -Xmx512M

我们正在努力获得每小时的主题标签数量。我们采用的方法是使用Window来+--------------------+--------------+-------------+--------------------+--------+-------------------+ | created_at|created_at_int| screen_name| hashtags|ht_count| single_hashtag| +--------------------+--------------+-------------+--------------------+--------+-------------------+ |2017-03-05 00:00:...| 1488672001| texanraj| [containers, cool]| 1| containers| |2017-03-05 00:00:...| 1488672001| texanraj| [containers, cool]| 1| cool| |2017-03-05 00:00:...| 1488672002| hubskihose|[automation, future]| 1| automation| |2017-03-05 00:00:...| 1488672002| hubskihose|[automation, future]| 1| future| |2017-03-05 00:00:...| 1488672002| IBMDevOps| [DevOps]| 1| devops| |2017-03-05 00:00:...| 1488672003|SoumitraKJana|[VoiceOfWipro, Cl...| 1| voiceofwipro| |2017-03-05 00:00:...| 1488672003|SoumitraKJana|[VoiceOfWipro, Cl...| 1| cloud| |2017-03-05 00:00:...| 1488672003|SoumitraKJana|[VoiceOfWipro, Cl...| 1| leader| |2017-03-05 00:00:...| 1488672003|SoumitraKJana| [Cloud, Cloud]| 1| cloud| |2017-03-05 00:00:...| 1488672003|SoumitraKJana| [Cloud, Cloud]| 1| cloud| |2017-03-05 00:00:...| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| voiceofwipro| |2017-03-05 00:00:...| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| cloud| |2017-03-05 00:00:...| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1|managedfiletransfer| |2017-03-05 00:00:...| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| asaservice| |2017-03-05 00:00:...| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| interconnect2017| |2017-03-05 00:00:...| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| hmi| |2017-03-05 00:00:...| 1488672005|SoumitraKJana|[Cloud, ManagedFi...| 1| cloud| |2017-03-05 00:00:...| 1488672005|SoumitraKJana|[Cloud, ManagedFi...| 1|managedfiletransfer| |2017-03-05 00:00:...| 1488672005|SoumitraKJana|[Cloud, ManagedFi...| 1| asaservice| |2017-03-05 00:00:...| 1488672005|SoumitraKJana|[Cloud, ManagedFi...| 1| interconnect2017| +--------------------+--------------+-------------+--------------------+--------+-------------------+ only showing top 20 rows root |-- created_at: timestamp (nullable = true) |-- created_at_int: integer (nullable = true) |-- screen_name: string (nullable = true) |-- hashtags: array (nullable = true) | |-- element: string (containsNull = true) |-- ht_count: integer (nullable = true) |-- single_hashtag: string (nullable = true) 进行分区。像这样:

single_hashtag

但是,当我们尝试使用以下内容执行# create WindowSpec hashtags_24_winspec = Window.partitionBy(hashtags_24.single_hashtag). \ orderBy(hashtags_24.created_at_int).rangeBetween(-3600, 3600) 列的总和时:

ht_count

我们收到以下错误:

#sum_count_over_time = sum(hashtags_24.ht_count).over(hashtags_24_winspec)

错误信息的信息量不大,我们感到困惑,究竟要对哪一列进行调查。有什么想法吗?

1 个答案:

答案 0 :(得分:5)

您使用了错误的sum

from pyspark.sql.functions import sum

sum_count_over_time = sum(hashtags_24.ht_count).over(hashtags_24_winspec)

在实践中,您可能需要别名或包导入:

from pyspark.sql.functions import sum as sql_sum

# or

from pyspark.sql.functions as F
F.sum(...)