pyspark-使用createDataFrame在json流数据中查找最大值和最小值

时间:2018-11-20 12:59:48

标签: python apache-spark pyspark apache-kafka

我有一组由Kafka流传输的json消息,每个消息描述一个网站用户。使用pyspark,我需要计算每个流窗口中每个国家/地区的用户数,并返回具有最大和最小用户数的国家/地区。

这是流式json消息的示例:

 sudo service moni.sh status
● moni.service
   Loaded: loaded (/etc/init.d/moni.sh; generated)
   Active: failed (Result: exit-code) since Tue 2018-11-20 13:53:37 CET; 1min 41s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 2917 ExecStart=/etc/init.d/moni.sh start (code=exited, status=203/EXEC)

nov 20 13:53:37 ubuntu systemd[1]: Starting moni.service...
nov 20 13:53:37 ubuntu systemd[2917]: moni.service: Failed to execute command: Exec format error
nov 20 13:53:37 ubuntu systemd[2917]: moni.service: Failed at step EXEC spawning /etc/init.d/moni.sh: Exec format error
nov 20 13:53:37 ubuntu systemd[1]: moni.service: Control process exited, code=exited status=203
nov 20 13:53:37 ubuntu systemd[1]: moni.service: Failed with result 'exit-code'.
nov 20 13:53:37 ubuntu systemd[1]: Failed to start moni.se

这是我的代码:

{"id":1,"first_name":"Barthel","last_name":"Kittel","email":"bkittel0@printfriendly.com","gender":"Male","ip_address":"130.187.82.195","date":"06/05/2018","country":"France"}

运行它时,我收到消息

from pyspark.sql.types import StructField, StructType, StringType
from pyspark.sql import Row
from pyspark import SparkContext
from pyspark.sql import SQLContext

fields = ['id', 'first_name', 'last_name', 'email', 'gender', 'ip_address', 'date', 'country']
schema =  StructType([
  StructField(field, StringType(), True) for field in fields
])

def parse(s, fields):
    try:
        d = json.loads(s[0])
        return [tuple(d.get(field) for field in fields)]
    except:
        return []

array_of_users = parsed.SQLContext.createDataFrame(parsed.flatMap(lambda s: parse(s, fields)), schema)

rdd = sc.parallelize(array_of_users)

# group by country and then substitute the list of messages for each country by its length, resulting into a rdd of (country, length) tuples
country_count = rdd.groupBy(lambda user: user['country']).mapValues(len)

# identify the min and max using as comparison key the second element of the (country, length) tuple
country_min = country_count.min(key = lambda grp: grp[1])
country_max = country_count.max(key = lambda grp: grp[1])

我该如何解决?

1 个答案:

答案 0 :(得分:1)

如果我的理解正确,您需要按国家/地区对邮件列表进行分组,然后对每个组中的邮件数进行计数,然后选择具有最小和最大邮件数的组。

在我脑海中,代码类似于:

# assuming the array_of_users is your array of messages
rdd = sc.parallelize(array_of_users)

# group by country and then substitute the list of messages for each country by its length, resulting into a rdd of (country, length) tuples
country_count = rdd.groupBy(lambda user: user['country']).mapValues(len)

# identify the min and max using as comparison key the second element of the (country, length) tuple
country_min = country_count.min(key = lambda grp: grp[1])
country_max = country_count.max(key = lambda grp: grp[1])