Question

我有一个以这种方式构建的CSV文件：

Header
Blank Row
"Col1","Col2"
"1,200","1,456"
"2,000","3,450"

我在阅读此文件时遇到两个问题。

我想忽略标题并忽略空白行
值中的逗号不是分隔符

以下是我的尝试：

df = sc.textFile("myFile.csv")\
              .map(lambda line: line.split(","))\ #Split By comma
              .filter(lambda line: len(line) == 2).collect() #This helped me ignore the first two rows

但是，这不起作用，因为值中的逗号被读作分隔符而len(line)返回4而不是2.

我尝试了另一种方法：

data = sc.textFile("myFile.csv")
headers = data.take(2) #First two rows to be skipped

这个想法是使用过滤器而不是读取标题。但是，当我尝试打印标题时，我得到了编码值。

[\x00A\x00Y\x00 \x00J\x00u\x00l\x00y\x00 \x002\x000\x001\x006\x00]

读取CSV文件并跳过前两行的正确方法是什么？

Answer 1

尝试将csv.reader与＆＃39; quotechar＆＃39;参数。它会正确分割线。之后，您可以根据需要添加过滤器。

import csv
from pyspark.sql.types import StringType

df = sc.textFile("test2.csv")\
           .mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'Col1')\
           .toDF(['Col1','Col2'])

Answer 2

对于您的第一个问题，只需使用zipWithIndex压缩RDD中的行并过滤您不想要的行。对于第二个问题，您可以尝试从行中删除第一个和最后一个双引号字符，然后在","上拆分该行。

rdd = sc.textFile("myfile.csv")
rdd.zipWithIndex().
    filter(lambda x: x[1] > 2).
    map(lambda x: x[0]).
    map(lambda x: x.strip('"').split('","')).
    toDF(["Col1", "Col2"])

虽然，如果您正在寻找在Spark中处理CSV文件的标准方法，那么最好使用数据库中的Pipelines documentation包。

Answer 3

如果CSV文件结构总是有两列，则可以实现Scala：

val struct = StructType(
  StructField("firstCol", StringType, nullable = true) ::
  StructField("secondCol", StringType, nullable = true) :: Nil)

val df = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "false")
  .option("inferSchema", "false")
  .option("delimiter", ",")
  .option("quote", "\"")
  .schema(struct)
  .load("myFile.csv")

df.show(false)

val indexed = df.withColumn("index", monotonicallyIncreasingId())
val filtered = indexed.filter(col("index") > 2).drop("index")

filtered.show(false)

结果是：

+---------+---------+
|firstCol |secondCol|
+---------+---------+
|Header   |null     |
|Blank Row|null     |
|Col1     |Col2     |
|1,200    |1,456    |
|2,000    |3,450    |
+---------+---------+

+--------+---------+
|firstCol|secondCol|
+--------+---------+
|1,200   |1,456    |
|2,000   |3,450    |
+--------+---------+

Answer 4

Zlidime的回答有正确的想法。工作解决方案是：

import csv

customSchema = StructType([ \
    StructField("Col1", StringType(), True), \
    StructField("Col2", StringType(), True)])

df = sc.textFile("file.csv")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=',', quotechar='"')).filter(lambda line: len(line) > 2 and line[0] != 'Col1')\
        .toDF(customSchema)

Answer 5

为什么不尝试DataFrameReader的{{1}} API？这很容易。对于这个问题，我猜这一行就足够了。

pyspark.sql

使用此API，您还可以使用其他一些参数（如标题行），忽略前导和尾随空格。这是链接：DataFrameReader API

如何使用PySpark将CSV文件作为dataFrame读取时跳过行？

5 个答案: