使用Python从MS Azure读取和写入文件

时间:2016-09-21 14:25:35

标签: python azure apache-spark pyspark

我是Python和Spark的新手,我正在尝试将文件从Azure加载到表格。下面是我的简单代码。

import os
import sys
os.environ['SPARK_HOME'] = "C:\spark-2.0.0-bin-hadoop2.74"
sys.path.append("C:\spark-2.0.0-bin-hadoop2.7\python")
sys.path.append("C:\spark-2.0.0-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip")
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import *
sc = SparkContext("local", "Simple App")


def loadFile(path, rowDelimeter, columnDelimeter, firstHeaderColName):
 
  
    loadedFile = sc.newAPIHadoopFile(path, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
                                      "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
                                      conf={"textinputformat.record.delimiter": rowDelimeter})
 
    
    rddData = loadedFile.map(lambda l:l[1].split(columnDelimeter)).filter(lambda f: f[0] != firstHeaderColName)
        
    return rddData


Schema= StructType([
    
    StructField("Column1", StringType(), True),
    StructField("Column2", StringType(), True),
    StructField("Column3", StringType(), True),
    StructField("Column4", StringType(), True)
        
        

])

rData= loadFile("wasbs://Storagename@Accountname.blob.core.windows.net/File.txt",
                     '\r\n',"#|#","Column1")
DF = sc.createDataFrame(Data,Schema)
DF.write.saveAsTable("Table1")

我收到类似FileNotFoundError的错误:[WinError 2]系统找不到指定的文件

1 个答案:

答案 0 :(得分:0)

@Miruthan, 据我所知,如果我们想将WASB中的数据读入Spark,则URL语法如下:

wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>

同时,由于Azure存储Blob(WASB)被用作与HDInsight群集关联的存储帐户。请您仔细检查一下吗?任何更新,请告诉我。