我的学生数据库在学生表中为每个学生有多个记录。
我正在将数据读入Spark数据帧,然后遍历Spark数据帧,隔离每个学生的记录,并对每个学生的记录进行一些处理。
到目前为止,我的代码:
from pyspark.sql import SparkSession
spark_session = SparkSession \
.builder \
.appName("app") \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.2") \
.getOrCreate()
class_3A = spark_session.sql("SQL")
for row in class_3A:
#for each student
#Print Name, Age and Subject Marks
我该怎么做?
答案 0 :(得分:2)
另一种方法是使用SparkSQL
>>> df = spark.createDataFrame([('Ankit',25),('Jalfaizy',22),('Suresh',20),('Bala',26)],['name','age'])
>>> df.show()
+--------+---+
| name|age|
+--------+---+
| Ankit| 25|
|Jalfaizy| 22|
| Suresh| 20|
| Bala| 26|
+--------+---+
>>> df.where('age > 20').show()
+--------+---+
| name|age|
+--------+---+
| Ankit| 25|
|Jalfaizy| 22|
| Bala| 26|
+--------+---+
>>> from pyspark.sql.functions import *
>>> df.select('name', col('age') + 100).show()
+--------+-----------+
| name|(age + 100)|
+--------+-----------+
| Ankit| 125|
|Jalfaizy| 122|
| Suresh| 120|
| Bala| 126|
+--------+-----------+
答案 1 :(得分:1)
必要的方法(除了Bala的SQL方法之外):
class_3A = spark_session.sql("SQL")
def process_student(student_row):
# Do Something with student_row
return processed_student_row
#"isolate records for each student"
# Each student record will be passed to process_student function for processing.
# Results will be accumulated to a new DF - result_df
result_df = class_3A.map(process_student)
# If you don't care about results and just want to do some processing:
class_3A.foreach(process_student)
答案 2 :(得分:0)
您可以遍历数据框中的每个记录,并使用列名访问它们
from pyspark.sql import Row
from pyspark.sql.functions import *
l = [('Ankit',25),('Jalfaizy',22),('Suresh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
schemaPeople = spark.createDataFrame(people)
schemaPeople.show(10, False)
for row in schemaPeople.rdd.collect():
print("Hi " + str(row.name) + " your age is : " + str(row.age) )
这将产生如下输出
+---+--------+
|age|name |
+---+--------+
|25 |Ankit |
|22 |Jalfaizy|
|20 |Suresh |
|26 |Bala |
+---+--------+
Hi Ankit your age is : 25
Hi Jalfaizy your age is : 22
Hi Suresh your age is : 20
Hi Bala your age is : 26
因此,您可以对数据帧的每个记录进行处理或执行一些逻辑。
答案 3 :(得分:0)
不确定我是否理解正确的问题,但是否要对 您可以使用dataframe函数基于任何列进行操作。示例:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql import Window
sc = SparkSession.builder.appName("example").\
config("spark.driver.memory","1g").\
config("spark.executor.cores",2).\
config("spark.max.cores",4).getOrCreate()
df1 = sc.read.format("csv").option("header","true").load("test.csv")
w = Window.partitionBy("student_id")
df2 = df1.groupBy("student_id").agg(f.sum(df1["marks"]).alias("total"))
df3 = df1.withColumn("max_marks_inanysub",f.max(df1["marks"]).over(w))
df3 = df3.filter(df3["marks"] == df3["max_marks_inanysub"])
df1.show()
df3.show()
样本数据
学生编号,科目,标记 1,数学,3 1,科学,6 2,数学,4 2,科学,7
输出
+ ---------- + ------- + ----- + | student_id |主题|标记| + ---------- + ------- + ----- + | 1 |数学| 3 | | 1 |科学| 6 | | 2 |数学| 4 | | 2 |科学| 7 | + ---------- + ------- + ----- +
+ ---------- + ------- + ----- + ------------------ + | student_id | subject |标记| max_marks_inanysub | + ---------- + ------- + ----- + ------------------ + | 1 |科学| 6 | 6 | | 2 |科学| 7 | 7 | + ---------- + ------- + ----- + ------------------ +