Question

我正在使用具有以下架构的csv：

>>> df.printSchema()
root
 |-- : string (nullable = true)
 |-- country_destination: string (nullable = true)
 |-- lat_destination: string (nullable = true)
 |-- lng_destination: string (nullable = true)
 |-- distance_km: string (nullable = true)
 |-- destination_km2: string (nullable = true)
 |-- destination_language : string (nullable = true)
 |-- language_levenshtein_distance: string (nullable = true)
 |-- id: string (nullable = true)
 |-- date_account_created: string (nullable = true)
 |-- timestamp_first_active: string (nullable = true)
 |-- date_first_booking: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- age: string (nullable = true)
 |-- signup_method: string (nullable = true)
 |-- signup_flow: string (nullable = true)
 |-- language: string (nullable = true)
 |-- affiliate_channel: string (nullable = true)
 |-- affiliate_provider: string (nullable = true)
 |-- first_affiliate_tracked: string (nullable = true)
 |-- signup_app: string (nullable = true)
 |-- first_device_type: string (nullable = true)
 |-- first_browser: string (nullable = true)

我想将其转换为标记点，以便在其上使用随机林。我发现以下示例代码处理文本数据，但我需要在非文本数据和比样本数据集更多的功能上使用它。当然我可以列出所有功能，但1）根本不优雅，2）我不知道如何以最正确的方式编写lambda函数。

示例代码我不知道如何修改：

data_hashed = df.map(lambda (label, text): LabeledPoint(label, text))

请告知如何将功能country_destination设为目标（或y变量，但是要调用它），其余为x变量。谢谢！

我主要使用Python但不反对学习一些Scala。

如何将具有许多功能的数据框转换为Spark中的标记点

0 个答案: