我正在使用具有以下架构的csv:
>>> df.printSchema()
root
|-- : string (nullable = true)
|-- country_destination: string (nullable = true)
|-- lat_destination: string (nullable = true)
|-- lng_destination: string (nullable = true)
|-- distance_km: string (nullable = true)
|-- destination_km2: string (nullable = true)
|-- destination_language : string (nullable = true)
|-- language_levenshtein_distance: string (nullable = true)
|-- id: string (nullable = true)
|-- date_account_created: string (nullable = true)
|-- timestamp_first_active: string (nullable = true)
|-- date_first_booking: string (nullable = true)
|-- gender: string (nullable = true)
|-- age: string (nullable = true)
|-- signup_method: string (nullable = true)
|-- signup_flow: string (nullable = true)
|-- language: string (nullable = true)
|-- affiliate_channel: string (nullable = true)
|-- affiliate_provider: string (nullable = true)
|-- first_affiliate_tracked: string (nullable = true)
|-- signup_app: string (nullable = true)
|-- first_device_type: string (nullable = true)
|-- first_browser: string (nullable = true)
我想将其转换为标记点,以便在其上使用随机林。我发现以下示例代码处理文本数据,但我需要在非文本数据和比样本数据集更多的功能上使用它。当然我可以列出所有功能,但1)根本不优雅,2)我不知道如何以最正确的方式编写lambda函数。
示例代码我不知道如何修改:
data_hashed = df.map(lambda (label, text): LabeledPoint(label, text))
请告知如何将功能country_destination
设为目标(或y变量,但是要调用它),其余为x变量。谢谢!
我主要使用Python但不反对学习一些Scala。