Spark错误地读取CSV

时间:2018-06-08 00:07:08

标签: python scala csv apache-spark

我想在spark中读取train.csv,但似乎火花在某种程度上错误地读取了文件。我使用python将csv读入pandas,它显示了正确的值1作为project_is_approved中的第一个条目。当我用spark(scala)读取csv时,我得到一个字符串,大概是来自数据集中的其他地方。

为什么会这样?大多数示例使用我用来读取csv的语法。

jakeu123@azure3:~$ python
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> f = requests.get("https://www.dropbox.com/s/2hdbltrl8bh6kbu/train.csv?raw=1", stream=True)
>>> with open("train.csv", "w") as csv:
...     csv.write(f)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
TypeError: expected a string or other character buffer object
>>> with open("train.csv", "w") as csv:
...     csv.write(f.content)
... 
>>> import pandas as pd
>>> df = pd.read_csv("train.csv")
>>> df[["project_is_approved"]].head(1)
   project_is_approved
0                    1
>>> 
jakeu123@azure3:~$ ./spark/bin/spark-shell
2018-06-07 23:55:02 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-06-07 23:55:09 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2018-06-07 23:55:09 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Spark context Web UI available at http://azure3:4042
Spark context available as 'sc' (master = local[*], app id = local-1528415709241).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = spark.read.option("header", true).csv("train.csv")
2018-06-07 23:55:27 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [id: string, teacher_id: string ... 14 more fields]

scala> df.select($"project_is_approved").show(1)
+--------------------+                                                          
| project_is_approved|
+--------------------+
|I currently have ...|
+--------------------+
only showing top 1 row


scala> :quit

2 个答案:

答案 0 :(得分:0)

据我所知,spark不能直接读取URL格式的文件。因此,不是使用python读取CSV文件并将其写入磁盘,以便稍后使用spark读取它,您可以在将其转换为Spark Dataframe之前使用pandas(通过使用数据帧,您将获得使用spark分布式计算的好处)

我对Scala并不熟悉所以我尝试使用pyspark解决它

class User(models.Model):
    id = models.CharField(max_length=255)
    name = models.CharField(max_length=255)
    desc = models.TextField()
    created_at = models.DateTimeField(auto_now=True)
    updated_at = models.DateTimeField(auto_now_add=True, null=True)

    def __str__(self):
        return self.user.name

哦,顺便说一下,我认为提供读取CSV文件的模式是必须的,因为它不会踢任何火花作业,因此你可以避免浪费计算资源,火花将以正确的格式读取文件

答案 1 :(得分:0)

您需要定义转义字符,以便在解析

时可以忽略文本中的逗号(,)

这可以作为

完成
scala> val df = spark.read.option("header",true).option("escape","\"").csv("train.csv");

            df: org.apache.spark.sql.DataFrame = [id: string, teacher_id: string ... 14 more fields]

            scala> df.select($"project_is_approved").show
            +-------------------+
            |project_is_approved|
            +-------------------+
            |                  1|
            |                  0|
            |                  1|
            |                  0|
            |                  1|
            |                  1|
            |                  1|
            |                  1|
            |                  1|
            |                  1|
            |                  1|
            |                  1|
            |                  1|
            |                  0|
            |                  1|
            |                  1|
            |                  1|
            |                  1|
            |                  1|
            |                  0|
            +-------------------+
            only showing top 20 rows

工作示例:

import firebase from 'firebase';
import 'firebase/storage';  // <----

firebase.initializeApp({
  ...
});
const storageRef = firebase.storage().ref();