Question

我正在尝试更改非常大的csv文件的标头。

我使用SparkSQL

所有标题在每个标题名称中都有some_string，例如some_string.header_name

我的Spark配置conf = SparkConf().setMaster("local[*]").setAppName("readCSV")

要阅读csv文件，请使用com.databricks.spark.csv package

   logs_df = sqlContext.load(
   source = "com.databricks.spark.csv",
   header = 'true',
   inferSchema ='true',
   path = 'my_file.csv'
)

我的代码

 header = logs_df.first()
 schemaString = header.replace('`some_string.`','')

产生错误：

  AttributeError                            
  Traceback (most recent call last)
  <ipython-input-63-ccfad59fc785> in <module>()

   1255             raise AttributeError(item)
   1256         except ValueError:
-> 1257             raise AttributeError(item)
   1258 
   1259     def __setattr__(self, key, value):

AttributeError: replace

我不喜欢使用logs_df.withColumnRenamed()，因为我有超过200列

非常适合任何想法如何快速有效地更改标题

Answer 1

我不太了解Python，只能为您提供示例代码段。希望这可能会给你一些Python的提示...... 选项1：我可能建议使用RDD并使用反射（http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection）构建模式。在scala中，我会使用一个case类来反映这一点。 RDD的一个问题可能与其性能有关。

选项2：另一个选项可能是使用DataFrame.toDF（colNames：String *）：DataFrame。基本上按照您想要的顺序排列/选择列，并使用以编程方式构建的序列提供列名称。

希望这有帮助。

SparkSQL，Spark DataFrame：批量重命名csv头文件

1 个答案: