在将csv文件作为数据框读取时提供模式

时间:2016-10-07 22:02:20

标签: scala apache-spark dataframe apache-spark-sql spark-csv

我正在尝试将csv文件读入数据帧。我知道我的数据帧的架构应该是什么,因为我知道我的csv文件。另外我使用spark csv包来读取文件。我试着像下面那样指定架构。

val pagecount = sqlContext.read.format("csv")
            .option("delimiter"," ").option("quote","")
            .option("schema","project: string ,article: string ,requests: integer ,bytes_served: long")
            .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

但是当我检查我创建的数据框架的模式时,似乎已经采用了自己的模式。我做错了吗?如何制作火花来挑选我提到的架构?

> pagecount.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)

13 个答案:

答案 0 :(得分:32)

请尝试以下操作,您无需指定架构。当你将inferSchema设为true时,它应该从你的csv文件中获取它。

val pagecount = sqlContext.read.format("csv")
     .option("delimiter"," ").option("quote","")
     .option("header", "true")
     .option("inferSchema", "true")
     .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

如果要手动指定架构,则需要执行以下操作

import org.apache.spark.sql.types._

val customSchema = StructType(Array(
        StructField("project", StringType, true),
        StructField("article", StringType, true),
        StructField("requests", IntegerType, true),
        StructField("bytes_served", DoubleType, true)))

     val pagecount = sqlContext.read.format("csv")
             .option("delimiter"," ").option("quote","")
             .option("header", "true")
             .schema(customSchema)
             .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

答案 1 :(得分:8)

我在分析中使用Arunakiran Nulu提供的解决方案(参见代码)。尽管它能够为列分配正确的类型,但返回的所有值都是null。以前,我尝试过选项.option("inferSchema", "true")并在数据框中返回正确的值(尽管类型不同)。

val customSchema = StructType(Array(
    StructField("numicu", StringType, true),
    StructField("fecha_solicitud", TimestampType, true),
    StructField("codtecnica", StringType, true),
    StructField("tecnica", StringType, true),
    StructField("finexploracion", TimestampType, true),
    StructField("ultimavalidacioninforme", TimestampType, true),
    StructField("validador", StringType, true)))

val df_explo = spark.read
        .format("csv")
        .option("header", "true")
        .option("delimiter", "\t")
        .option("timestampFormat", "yyyy/MM/dd HH:mm:ss") 
        .schema(customSchema)
        .load(filename)

结果

root


|-- numicu: string (nullable = true)
 |-- fecha_solicitud: timestamp (nullable = true)
 |-- codtecnica: string (nullable = true)
 |-- tecnica: string (nullable = true)
 |-- finexploracion: timestamp (nullable = true)
 |-- ultimavalidacioninforme: timestamp (nullable = true)
 |-- validador: string (nullable = true)

,表格是:

|numicu|fecha_solicitud|codtecnica|tecnica|finexploracion|ultimavalidacioninforme|validador|
+------+---------------+----------+-------+--------------+-----------------------+---------+
|  null|           null|      null|   null|          null|                   null|     null|
|  null|           null|      null|   null|          null|                   null|     null|
|  null|           null|      null|   null|          null|                   null|     null|
|  null|           null|      null|   null|          null|                   null|     null|

答案 2 :(得分:6)

感谢@Nulu的回答,它适用于pyspark,只需要很少的调整

new Point(x,y)

答案 3 :(得分:6)

以前的解决方案使用了自定义的StructType。

使用spark-sql 2.4.5(scala版本2.12.10),现在可以使用schema函数将模式指定为字符串

import org.apache.spark.sql.SparkSession;

val sparkSession = SparkSession.builder()
            .appName("sample-app")
            .master("local[2]")
            .getOrCreate();

val pageCount = sparkSession.read
  .format("csv")
  .option("delimiter","|")
  .option("quote","")
  .schema("project string ,article string ,requests integer ,bytes_served long")
  .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

答案 4 :(得分:5)

对于那些有兴趣在Python中执行此操作的人,这里是一个有效的版本。

customSchema = StructType([
    StructField("IDGC", StringType(), True),        
    StructField("SEARCHNAME", StringType(), True),
    StructField("PRICE", DoubleType(), True)
])
productDF = spark.read.load('/home/ForTesting/testProduct.csv', format="csv", header="true", sep='|', schema=customSchema)

testProduct.csv
ID|SEARCHNAME|PRICE
6607|EFKTON75LIN|890.88
6612|EFKTON100HEN|55.66

希望这会有所帮助。

答案 5 :(得分:4)

以下是使用自定义架构的方法,一个完整的演示:

$> shell代码,

echo "
Slingo, iOS 
Slingo, Android
" > game.csv

Scala代码:

import org.apache.spark.sql.types._

val customSchema = StructType(Array(
  StructField("game_id", StringType, true),
  StructField("os_id", StringType, true)
))

val csv_df = spark.read.format("csv").schema(customSchema).load("game.csv")
csv_df.show 

csv_df.orderBy(asc("game_id"), desc("os_id")).show
csv_df.createOrReplaceTempView("game_view")
val sort_df = sql("select * from game_view order by game_id, os_id desc")
sort_df.show 

答案 6 :(得分:0)

这是一种选项,我们可以在加载CSV时将列名传递给数据框。

import pandas
    names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
    dataset = pandas.read_csv("C:/Users/NS00606317/Downloads/Iris.csv", names=names, header=0)
print(dataset.head(10))

输出

    sepal-length  sepal-width  petal-length  petal-width        class
1            5.1          3.5           1.4          0.2  Iris-setosa
2            4.9          3.0           1.4          0.2  Iris-setosa
3            4.7          3.2           1.3          0.2  Iris-setosa
4            4.6          3.1           1.5          0.2  Iris-setosa
5            5.0          3.6           1.4          0.2  Iris-setosa
6            5.4          3.9           1.7          0.4  Iris-setosa
7            4.6          3.4           1.4          0.3  Iris-setosa
8            5.0          3.4           1.5          0.2  Iris-setosa
9            4.4          2.9           1.4          0.2  Iris-setosa
10           4.9          3.1           1.5          0.1  Iris-setosa

答案 7 :(得分:0)

// import Library
import java.io.StringReader ;

import au.com.bytecode.opencsv.CSVReader

//filename

var train_csv = "/Path/train.csv";

//read as text file

val train_rdd = sc.textFile(train_csv)   

//use string reader to convert in proper format

var full_train_data  = train_rdd.map{line =>  var csvReader = new CSVReader(new StringReader(line)) ; csvReader.readNext();  }   

//declares  types

type s = String

// declare case class for schema

case class trainSchema (Loan_ID :s ,Gender :s, Married :s, Dependents :s,Education :s,Self_Employed :s,ApplicantIncome :s,CoapplicantIncome :s,
    LoanAmount :s,Loan_Amount_Term :s, Credit_History :s, Property_Area :s,Loan_Status :s)

//create DF RDD with custom schema 

var full_train_data_with_schema = full_train_data.mapPartitionsWithIndex{(idx,itr)=> if (idx==0) itr.drop(1); 
                     itr.toList.map(x=> trainSchema(x(0),x(1),x(2),x(3),x(4),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12))).iterator }.toDF

答案 8 :(得分:0)

  

模式定义为简单字符串

如果有人对带有 date 时间戳 的简单字符串的模式定义感兴趣,

  

从终端或外壳程序创建数据文件

echo " 
2019-07-02 22:11:11.000999, 01/01/2019, Suresh, abc  
2019-01-02 22:11:11.000001, 01/01/2020, Aadi, xyz 
" > data.csv
  

将架构定义为字符串

    user_schema = 'timesta TIMESTAMP,date DATE,first_name STRING , last_name STRING'
  

读取数据

    df = spark.read.csv(path='data.csv', schema = user_schema, sep=',', dateFormat='MM/dd/yyyy',timestampFormat='yyyy-MM-dd HH:mm:ss.SSSSSS')

    df.show(10, False)

    +-----------------------+----------+----------+---------+
    |timesta                |date      |first_name|last_name|
    +-----------------------+----------+----------+---------+
    |2019-07-02 22:11:11.999|2019-01-01| Suresh   | abc     |
    |2019-01-02 22:11:11.001|2020-01-01| Aadi     | xyz     |
    +-----------------------+----------+----------+---------+
  

请注意,显式定义架构而不是让spark推断架构也可以提高spark读取性能。

答案 9 :(得分:0)

在pyspark 2.4及更高版本中,您只需使用header参数即可设置正确的标头:

data = spark.read.csv('data.csv', header=True)

类似地,如果使用scala,则也可以使用header参数。

答案 10 :(得分:0)

您还可以通过使用sparkSession和隐式

来做到这一点
pat_ID

答案 11 :(得分:0)

如果您的 spark 版本是 3.0.1,您可以使用以下 Scala 脚本:

val df = spark.read.format("csv").option("delimiter",",").option("header",true).load("file:///LOCAL_CSV_FILE_PATH")

但是这样,所有的数据类型都会被设置为 String

答案 12 :(得分:-1)

我的解决方法是:

import org.apache.spark.sql.types._
  val spark = org.apache.spark.sql.SparkSession.builder.
  master("local[*]").
  appName("Spark CSV Reader").
  getOrCreate()

val movie_rating_schema = StructType(Array(
  StructField("UserID", IntegerType, true),
  StructField("MovieID", IntegerType, true),
  StructField("Rating", DoubleType, true),
  StructField("Timestamp", TimestampType, true)))

val df_ratings: DataFrame = spark.read.format("csv").
  option("header", "true").
  option("mode", "DROPMALFORMED").
  option("delimiter", ",").
  //option("inferSchema", "true").
  option("nullValue", "null").
  schema(movie_rating_schema).
  load(args(0)) //"file:///home/hadoop/spark-workspace/data/ml-20m/ratings.csv"

val movie_avg_scores = df_ratings.rdd.map(_.toString()).
  map(line => {
    // drop "[", "]" and then split the str 
    val fileds = line.substring(1, line.length() - 1).split(",")
    //extract (movie id, average rating)
    (fileds(1).toInt, fileds(2).toDouble)
  }).
  groupByKey().
  map(data => {
    val avg: Double = data._2.sum / data._2.size
    (data._1, avg)
  })