PySpark执行目录中的所有测试用例

时间:2016-09-02 13:14:37

标签: pyspark apache-spark-sql

我正在尝试开发一个脚本,它将运行保存在目录中的所有spark sql查询。 我已经能够用Python做到这一点,但是pyspark是一个不同的游戏。 下面是我用来读取和执行目录中所有查询文件的python脚本。

  

import sys,csv,sqlite3,codecs,unicodedata,string,glob,os,c   conn = psycopg2.connect(database =“xxx”,user =“xxxx”,password =“xxxx”,   host =“localhost”,port =“5432”)cur = conn.cursor()print(“done”)

     

用open(“*。txt”,“r”)作为ins:用于ins中的行:
  words = line.split('|')print(words)query = words [0]
  pmicode = words [1] print(query)        cur = conn.cursor()        cur.execute(query)conn.commit()conn.close()

是否可以在PySpark中复制它?

谢谢, 的Pankaj

1 个答案:

答案 0 :(得分:0)

我猜你想让pyspark从你在这个python脚本中使用的postgres数据库中提取数据。

如果Python中的当前代码类似于:

import sys, csv, sqlite3, codecs, unicodedata, string, glob, os
conn = psycopg2.connect(database="xxx", user="xxxx", password="xxxx", host="localhost", port="5432")
cur = conn.cursor()
print("done")

def runSQL(query):
    cur = conn.cursor()
    cur.execute(query)
    conn.commit()

with open("*.txt", "r") as ins:
    for line in ins:
        words = line.split('|')
        print(words)
        query = words[0]
        pmicode = words[1]
        print(query)

conn.close()

等效的是使用JDBC连接并使用sqlContext执行命令:

import sys, csv, sqlite3, codecs, unicodedata, string, glob, os
postgres_url = 'jdbc:postgresql://localhost:5432/database'
properties = {"user": "xxxx", "password": "xxxx"}
print("done")

def runSQL(query):
    return sqlContext.read.jdbc(
        url=postgres_url,
        table="( {0} ) TEMPDB_SPARK_DELINQ".format(query)

with open("*.txt", "r") as ins:
    for line in ins:
        words = line.split('|')
        print(words)
        query = words[0]
        pmicode = words[1]
        print(query)
        runSQL(query)