在我的代码中,我使用pyspark
进行数据操作,使用Python graphene
围绕数据构建graphql
服务器,并使用flask
服务graphql api。
但是我遇到了一些我不知道为什么会发生的问题。
基本上,代码无法计算出正确的结果,而我假设总体上可能存在一些并发问题。
这是我的代码的简化:
import graphene
from flask import Flask
from flask_graphql import GraphQLView
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
app = Flask(__name__)
spark = SparkSession.builder \
.master("local") \
.appName("Sencivil") \
.config("spark.driver.allowMultipleContexts", "true") \
.getOrCreate()
df = spark.read.format("csv")\
.option("sep", ";")\
.option("header", "true")\
.load("./people.csv")\
.cache()
class Birthdate(graphene.ObjectType):
boys = graphene.Int()
girls = graphene.Int()
def __init__(self, bd):
self.bd = bd
def resolve_boys(self, info):
boys = df.where(df.gender == 1).groupby("birthdate").count()
return boys.count()
def resolve_girls(self, info):
girls = df.where(df.gender == 2).groupby("birthdate").count()
return girls.count()
class Person(graphene.ObjectType):
born = graphene.List(Birthdate)
def resolve_born(self, info):
bds = [Birthdate(row.asDict()["birthdate"])
for row in df.select("birthdate").collect()]
return bds
class Query(graphene.ObjectType):
people = graphene.Field(Person)
def resolve_people(self, info):
return Person()
schema = graphene.Schema(query=Query)
app.add_url_rule(
"/graphql", view_func=GraphQLView.as_view('graphql', schema=schema, graphiql=True))
if __name__ == "__main__":
app.run(debug=True)
people.csv
birthdate;gender
13-01-1987;1
02-09-1986;2
13-01-1987;1
02-09-1986;2
12-04-1998;1
问题出在resolve_boys
类的resolve_girls
和Birthdate
方法上,我希望男孩能得到2,女孩可以得到1 pyspark外壳程序:
>>> boys = df.where(df.gender == 1).groupby("birthdate").count()
>>> boys.show()
+----------+-----+
| birthdate|count|
+----------+-----+
|13-01-1987| 2|
|12-04-1998| 1|
+----------+-----+
>>> girls = df.where(df.gender == 2).groupby("birthdate").count()
>>> girls.show()
+----------+-----+
| birthdate|count|
+----------+-----+
|02-09-1986| 2|
+----------+-----+
但是相反,我只是从api中获得0:
那么如何解决这个问题?
如果我将读取csv文件的代码更改为像这样在本地创建数据框
# df = spark.read.format("csv")\
# .option("sep", ";")\
# .option("header", "true")\
# .load("./people.csv")\
# .cache()
df = spark.createDataFrame(
[{'birthdate': "13-01-1987", "gender": 1},
{'birthdate': "02-09-1986", "gender": 2},
{'birthdate': "13-01-1987", "gender": 1},
{'birthdate': "02-09-1986", "gender": 2},
{'birthdate': "12-04-1998", "gender": 1}]
)
我得到正确的答案:
因此,在读取csv文件时会出现问题。但是为什么会这样呢?
答案 0 :(得分:0)
您解决了您的问题吗?正如您所指出的那样,似乎数据帧未正确加载,但我相信它是由此行.load("./people.csv")\
导致的,该行未指向您的csv文件。
以防万一,您能否将代码包装在try-except块中以打印出数据框或错误响应?
干杯!