我目前正在尝试获取特定IP正在运行的服务量,这些服务位于publishCoverage adapters: [jacocoAdapter('target/site/jacoco/jacoco.xml')]
列中,以symbol report spanish german english
A hgdtfs spanish german 1 1 0
B hg english ghghyg 0 0 1
C jhghgy spanish english german 1 1 1
的形式存储在Spark DataFrame中,并通过逗号分隔。如何在每个字段中拆分字符串(拆分基于逗号),然后汇总每个字段的返回列表的长度?
答案 0 :(得分:1)
使用PySpark API:
>>> df = spark.createDataFrame([("10.0.0.1", "session1,session2"), ("10.0.0.2", "session1,session3,session4")], ["ip", "session"])
>>> df.show(100, False)
+--------+--------------------------+
|ip |session |
+--------+--------------------------+
|10.0.0.1|session1,session2 |
|10.0.0.2|session1,session3,session4|
+--------+--------------------------+
>>> from pyspark.sql.functions import *
>>> df = df.withColumn("count", size(split(col("session"), ",")))
>>> df.show(100, False)
+--------+--------------------------+-----+
|ip |session |count|
+--------+--------------------------+-----+
|10.0.0.1|session1,session2 |2 |
|10.0.0.2|session1,session3,session4|3 |
+--------+--------------------------+-----+
您可以在此处了解有关PySpark API的更多信息:https://spark.apache.org/docs/latest/api/python/pyspark.sql.html