如何使Pyspark spark.sql数据框中的数据同质

时间:2019-04-11 00:14:58

标签: python-3.x pandas pyspark pyspark-sql

我下载了一个包含AirBnB数据的1.9 GB csv文件。尽管所有列的数据类型均为“字符串”,但我有一些列不是“同质的”,例如“便利设施”列,其中某些条目在该特定属性处具有便利设施数量,而其他条目具有设施清单。全部为字符串格式。

所以,这就是我到目前为止所拥有的:

from pyspark import SparkContext, SparkConf
import pandas as pd
import numpy as np
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)

from pyspark.sql import SQLContext
SQLCtx = SQLContext(sc)

air =SQLCtx.read.load('/home/john/Downloads/airbnb-listings.csv',
                             format = "com.databricks.spark.csv",
                             header = "true",
                             sep = ";",
                             inferSchema = "true")

#check for missing values
from pyspark.sql.functions import col,sum
air.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in air.columns)).show()

因此,在删除几列然后删除缺失的值之后,我得到了:

Keep = ['Price', 'Bathrooms', 'Bedrooms', 'Beds', 'Bed Type', 'Amenities',
       'Security Deposit', 'Cleaning Fee', 'Guests Included', 'Extra People',
       'Review Scores Rating', 'Cancellation Policy','Host Response Rate', 
       'Country Code', 'Zipcode']

data = air.select(*Keep)
reduced2 = data.na.drop()

#final shape after dropping missing values.
print((reduced2.count(), len(reduced2.columns)))

我可以将几行转换为熊猫数据框:

df3 = pd.DataFrame(reduced2.take(50), columns = reduced2.columns)

“便利设施”列表中的一小部分:

Wireless Internet,Air conditioning,Kitchen,Fre...
2                                                    10
3     Internet,Wireless Internet,Air conditioning,Ki...
4     TV,Cable TV,Internet,Wireless Internet,Air con...
5     TV,Wireless Internet,Air conditioning,Pool,Kit...
6     TV,Wireless Internet,Air conditioning,Pool,Kit...
7     Internet,Wireless Internet,Kitchen,Free parkin...
8     TV,Wireless Internet,Air conditioning,Pool,Kit...
9     Wireless Internet,Air conditioning,Kitchen,Fre...
10    TV,Cable TV,Internet,Wireless Internet,Air con...
14                                                   10
16                                                   10
17    TV,Internet,Wireless Internet,Air conditioning...
18    TV,Cable TV,Internet,Wireless Internet,Air con...
19    TV,Internet,Wireless Internet,Air conditioning...
20    TV,Wireless Internet,Air conditioning,Pool,Kit...
23    TV,Cable TV,Internet,Wireless Internet,Air con...
28                                                    9
33                                                   10
34    Internet,Wireless Internet,Kitchen,Elevator in...
37                                                   10

如您所见,我将无法按原样处理此问题。 我可以轻松地在普通大熊猫中做一些事情来解决它,就像这样:

for i in range(len(df3['Amenities'])):
    if len(df3["Amenities"][i])>2:
        df3['Amenities'][i] = str(len(df3['Amenities'][i].split(',')))

现在,我意识到这可能不是最好的方法,但是它将列表中的所有内容都转化为数字。 我需要的是一种对pyspark SQL数据帧中的列进行此类操作的方法,如果可能的话。

谢谢!

2 个答案:

答案 0 :(得分:1)

如果我对您的理解正确,那么您想计算以nuget restore "%DEPLOYMENT_SOURCE%\{SolutionPath}" "%MSBUILD_PATH%" {MSBuildArguments} 分隔的项目数,但是保留已经是数字的行。如果是这样,您可以尝试以下操作:

SELECT ename, ROUND ( (SYSDATE - hiredate) / 365) AS year
  FROM emp
 WHERE year >= (SELECT ROUND (AVG ( (SYSDATE - a.hiredate) / 365))
                  FROM emp a)

因此,如果Amenities列是整数,,则将其保持为from pyspark.sql import functions as F df.withColumn('Amenities' , F.when(df.Amenities.rlike('^\d+$'), df.Amenities) \ .otherwise(F.size(F.split('Amenities', ","))) \ .astype("string") ).show() ,否则,请使用df.Amenities.rlike('^\d+$')df.Amenities计算项目数。然后将结果转换为“字符串”

答案 1 :(得分:0)

我不熟悉PySpark SQL数据框,只有香草熊猫。

不确定您要执行的任务是什么,但是可以考虑将该列分为两列。例如。 (假设这在PySpark中是可能的):

df['Amenities_count'] = pd.to_numeric(df['Amenities'], errors='coerce')
mask_entries_with_list = df['Amenities_count'].isna()
mask_entries_with_number = ~mask_entries_with_list
df.loc[mask_entries_with_number, 'Amenities'] = []
df.loc[mask_entries_with_list, 'Amenities_count'] = df['Amenities'].apply(len)

(未经测试)