Pyspark groupBy Pivot Transformation

时间:2018-02-22 16:30:32

标签: pyspark pivot spark-dataframe pyspark-sql

我很难构建以下Pyspark数据帧操作。

基本上我试图按类别分组,然后转动/取消融合子类别并添加新列。

enter image description here

我尝试了很多方法,但它们非常慢,而且没有利用Spark的并行性。

这是我现有的(慢速,详细)代码:

from pyspark.sql.functions import lit

df = sqlContext.table('Table')

#loop over category
listids = [x.asDict().values()[0] for x in df.select("category").distinct().collect()]
dfArray = [df.where(df.category == x) for x in listids]
for d in dfArray:
  #loop over subcategory
  listids_sub = [x.asDict().values()[0] for x in d.select("sub_category").distinct().collect()]
  dfArraySub = [d.where(d.sub_category == x) for x in listids_sub]
  num = 1

  for b in dfArraySub:
    #renames all columns to append a number
    for c in b.columns:
      if c not in ['category','sub_category','date']:
        column_name = str(c)+'_'+str(num)
        b = b.withColumnRenamed(str(c), str(c)+'_'+str(num))
        b = b.drop('sub_category')
    num += 1
    #if no df exists, create one and continually join new columns
    try:
      all_subs = all_subs.drop('sub_category').join(b.drop('sub_category'), on=['cateogry','date'], how='left')
    except:
      all_subs = b

  #Fixes missing columns on union
  try:
    try:
      diff_columns = list(set(all_cats.columns) - set(all_subs.columns))
      for d in diff_columns:
        all_subs = all_subs.withColumn(d, lit(None))
      all_cats = all_cats.union(all_subs)
    except:
      diff_columns = list(set(all_subs.columns) - set(all_cats.columns))
      for d in diff_columns:
        all_cats = all_cats.withColumn(d, lit(None))
      all_cats = all_cats.union(all_subs)

  except Exception as e:
    print e
    all_cats = all_subs

但这很慢。任何指导将不胜感激!

1 个答案:

答案 0 :(得分:1)

您的输出不合逻辑,但我们可以使用pivot功能实现此结果。你需要确定你的规则,否则我可以看到很多情况可能会失败。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url = urlopen('http://www.imdb.com/title/tt0111161/reviews?ref_=tt_ov_rt').read()

soup = BeautifulSoup(url,"html.parser")

print(soup.prettify())

review_title = soup.find("div",attrs={"class":"lister"}).findAll("div",{"class":"title"})
review = soup.find("div",attrs={"class":"text"})
review = soup.find("div",attrs={"class":"text"}).findAll("div",{"class":"text"})
rating = soup.find("span",attrs={"class":"rating-other-user-rating"}).findAll("span")