Question

我正在尝试加载具有接近1000条记录的网站催化剂数据。下面显示的是我使用的代码：

    from pyspark.sql.types import *
    from pyspark.sql import Row
    sqlContext = SQLContext(sc)
    omni_rdd = sc.textFile('hdfs://user/temp/sitecatalyst20170101.gz')
    omni_rdd_delim = omni_rdd.map(lambda line: line.split("\t"))
    omni_df = omni_rdd_delim.map(lambda line: Row(
      col_1 =   line[0]
    , col_2 =   line[1]
    , col_3 =   line[2]
    , ..
    , ..
    , col_999 = line[998]
    )).toDF()

我遇到了以下错误：

  File "<stdin>", line 2
  SyntaxError: more than 255 arguments

有什么办法可以将所有1000列加载到我的数据框中吗？

-V

Answer 1

你可以这样做。定义一个列名为

的列表

cols = ['col_0' ,'col_1' ,'col_2' ,.........., 'col_999']

在创建dataFrame时使用它

omni_rdd = sc.textFile('hdfs://user/temp/sitecatalyst20170101.gz')
omni_rdd_delim = omni_rdd.map(lambda line: line.split(","))
omni_df = omni_rdd_delim.toDF(cols)

有没有办法将超过255列加载到Spark Dataframe？

1 个答案: