Question

所以，我们有点困惑。在Jupyter Notebook中，我们有以下数据框：

Variable name: _JAVA_OPTIONS
Variable value: -Xmx512M

我们正在努力获得 | created_at| +--------------------+ |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| |2017-03-05 00:00:...| +--------------------+ only showing top 20 rows root |-- created_at: |-- created_at_int: |-- screen_name: |-- hashtags: array (nullable = true) | |-- element: |-- ht_count: |-- single_hashtag: 进行分区。像这样：每小时的主题标签数量。我们采用的方法是使用Window来+--------------------+--------------+-------------+--------------------+--------+-------------------+ created_at_int| screen_name| hashtags|ht_count| single_hashtag| --------------+-------------+--------------------+--------+-------------------+ 1488672001| texanraj| [containers, cool]| 1| containers| 1488672001| texanraj| [containers, cool]| 1| cool| 1488672002| hubskihose|[automation, future]| 1| automation| 1488672002| hubskihose|[automation, future]| 1| future| 1488672002| IBMDevOps| [DevOps]| 1| devops| 1488672003|SoumitraKJana|[VoiceOfWipro, Cl...| 1| voiceofwipro| 1488672003|SoumitraKJana|[VoiceOfWipro, Cl...| 1| cloud| 1488672003|SoumitraKJana|[VoiceOfWipro, Cl...| 1| leader| 1488672003|SoumitraKJana| [Cloud, Cloud]| 1| cloud| 1488672003|SoumitraKJana| [Cloud, Cloud]| 1| cloud| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| voiceofwipro| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| cloud| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1|managedfiletransfer| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| asaservice| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| interconnect2017| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| hmi| 1488672005|SoumitraKJana|[Cloud, ManagedFi...| 1| cloud| 1488672005|SoumitraKJana|[Cloud, ManagedFi...| 1|managedfiletransfer| 1488672005|SoumitraKJana|[Cloud, ManagedFi...| 1| asaservice| 1488672005|SoumitraKJana|[Cloud, ManagedFi...| 1| interconnect2017| --------------+-------------+--------------------+--------+-------------------+ timestamp (nullable = true) integer (nullable = true) string (nullable = true) string (containsNull = true) integer (nullable = true) string (nullable = true)



single_hashtag

但是，当我们尝试使用以下内容执行# create WindowSpec                                 
hashtags_24_winspec = Window.partitionBy(hashtags_24.single_hashtag). \  
            orderBy(hashtags_24.created_at_int).rangeBetween(-3600, 3600)
列的总和时：

ht_count

我们收到以下错误：

#sum_count_over_time = sum(hashtags_24.ht_count).over(hashtags_24_winspec)


错误信息的信息量不大，我们感到困惑，究竟要对哪一列进行调查。有什么想法吗？

Answer 1

您使用了错误的sum：

from pyspark.sql.functions import sum

sum_count_over_time = sum(hashtags_24.ht_count).over(hashtags_24_winspec)

在实践中，您可能需要别名或包导入：

from pyspark.sql.functions import sum as sql_sum

# or

from pyspark.sql.functions as F
F.sum(...)

列在pySpark中不可迭代

1 个答案: