我想在spark中读取train.csv,但似乎火花在某种程度上错误地读取了文件。我使用python将csv读入pandas,它显示了正确的值1作为project_is_approved中的第一个条目。当我用spark(scala)读取csv时,我得到一个字符串,大概是来自数据集中的其他地方。
为什么会这样?大多数示例使用我用来读取csv的语法。
jakeu123@azure3:~$ python
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> f = requests.get("https://www.dropbox.com/s/2hdbltrl8bh6kbu/train.csv?raw=1", stream=True)
>>> with open("train.csv", "w") as csv:
... csv.write(f)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
TypeError: expected a string or other character buffer object
>>> with open("train.csv", "w") as csv:
... csv.write(f.content)
...
>>> import pandas as pd
>>> df = pd.read_csv("train.csv")
>>> df[["project_is_approved"]].head(1)
project_is_approved
0 1
>>>
jakeu123@azure3:~$ ./spark/bin/spark-shell
2018-06-07 23:55:02 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-06-07 23:55:09 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2018-06-07 23:55:09 WARN Utils:66 - Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Spark context Web UI available at http://azure3:4042
Spark context available as 'sc' (master = local[*], app id = local-1528415709241).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val df = spark.read.option("header", true).csv("train.csv")
2018-06-07 23:55:27 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [id: string, teacher_id: string ... 14 more fields]
scala> df.select($"project_is_approved").show(1)
+--------------------+
| project_is_approved|
+--------------------+
|I currently have ...|
+--------------------+
only showing top 1 row
scala> :quit
答案 0 :(得分:0)
据我所知,spark不能直接读取URL格式的文件。因此,不是使用python读取CSV文件并将其写入磁盘,以便稍后使用spark读取它,您可以在将其转换为Spark Dataframe之前使用pandas(通过使用数据帧,您将获得使用spark分布式计算的好处)
我对Scala并不熟悉所以我尝试使用pyspark解决它
class User(models.Model):
id = models.CharField(max_length=255)
name = models.CharField(max_length=255)
desc = models.TextField()
created_at = models.DateTimeField(auto_now=True)
updated_at = models.DateTimeField(auto_now_add=True, null=True)
def __str__(self):
return self.user.name
哦,顺便说一下,我认为提供读取CSV文件的模式是必须的,因为它不会踢任何火花作业,因此你可以避免浪费计算资源,火花将以正确的格式读取文件
答案 1 :(得分:0)
您需要定义转义字符,以便在解析
时可以忽略文本中的逗号(,)这可以作为
完成scala> val df = spark.read.option("header",true).option("escape","\"").csv("train.csv");
df: org.apache.spark.sql.DataFrame = [id: string, teacher_id: string ... 14 more fields]
scala> df.select($"project_is_approved").show
+-------------------+
|project_is_approved|
+-------------------+
| 1|
| 0|
| 1|
| 0|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 0|
| 1|
| 1|
| 1|
| 1|
| 1|
| 0|
+-------------------+
only showing top 20 rows
工作示例:
import firebase from 'firebase';
import 'firebase/storage'; // <----
firebase.initializeApp({
...
});
const storageRef = firebase.storage().ref();