我正在尝试通过pyspark中的以下代码将sql server表转换为.csv格式。
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://server:port").option("databaseName","database").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","table").option("user","uid").option("password","pwd").load()
df.registerTempTable("test")
df.write.format("com.databricks.spark.csv").save("full_path")
所以,如果我想转换多个表,我需要编写多个Data Frames.So,为了避免它,我想在数据库名称中使用命令行参数,并在迭代时使用用户的表名数据帧通过for循环。
甚至可能吗?如果是的话,有人可以通过spark-submit指导我如何做到这一点吗?
答案 0 :(得分:3)
只需对spark-submit命令和代码执行此更改:
test.py
import sys
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
db_name = sys.argv[1]
table_name = sys.argv[2]
file_name = sys.argv[3]
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://server:port").option("databaseName",db_name).option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable",table_name).option("user","uid").option("password","pwd").load()
df.registerTempTable("test")
df.write.format("com.databricks.spark.csv").save(file_name)
Spark-submit命令:
spart-submit test.py <db_name> <table_name> <file_name>