我正在pairwise correlation
中尝试Pyspark
。我正在读取输入文件,然后从中形成dataframe
。现在将其传递给pyspark
相关函数我需要将其转换为rdd of vectors
。这是我目前的代码:
input = sc.textFile('File1.csv')
header = input.first() # extract header
data = input.filter(lambda x: x != header)
parsedInput = data.map(lambda l: l.split(","))
# define schema
schemaString = "col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
df_i = sqlContext.createDataFrame(parsedInput, schema)
现在根据this page的pyspark
文件,这是计算相关性的方法:
data = ... # an RDD of Vectors
print(Statistics.corr(data, method="pearson"))
如何将dataframe
df_i
转换为RDD of vectors
,以便将其传递给corr()
?
此外,如果有更好的方法(比我目前为止)读取输入文件并使用pyspark对该文件进行成对关联,那么请给我一个示例。
更新:以下是我的数据的示例输入:
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12
Cameroon,15 - 24 years old,8,160,408,3387232,no,no,0,82.7,116,0.712931034
Cameroon,15 - 24 years old,8,90,408,3683931,no,yes,39,94.8,89,1.065168539
Cameroon,15 - 24 years old,8,104,408,3663917,no,no,0,183.6,133,1.380451128
Cameroon,15 - 24 years old,8,96,408,3292045,no,no,0,144,102,1.411764706
Cameroon,25 - 39 years old,8,126,408,3399798,yes,no,0,197.6,126,1.568253968
Cameroon,15 - 24 years old,8,146,408,3483581,no,no,0,109,69,1.579710145
Cameroon,15 - 24 years old,8,34,408,3396446,no,no,0,128.8,80,1.61
Cameroon,15 - 24 years old,8,93,408,3607246,no,yes,42,166.9,101,1.652475248
Cameroon,15 - 24 years old,8,42,408,3577060,no,no,0,146.3,84,1.741666667
Cameroon,15 - 24 years old,8,57,408,3573817,no,yes,39,213,115,1.852173913
Cameroon,15 - 24 years old,8,94,408,3444022,no,no,0,207,109,1.899082569