我有一个名为python
的{{1}}文件。在这个文件中,我将执行一些test.py
命令。
pyspark
在#!/usr/bin/env python
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
# create a data frame from hive tables
df=sqlContext.table("testing.test")
# register the data frame as temp table
df.registerTempTable('mytempTable')
# find number of records in data frame
records = df.count()
print "records='%s'" %records
if records < 1000000:
sqlContext.sql("create table {}.{} stored as parquet as select * from mytempTable".format(hivedb,table))
else:
sqlContext.sql("create table {}.{} stored as parquet as select * from mytempTable where id <= 1000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 1000000 and id <= 2000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 2000000 and id <= 3000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 3000000 and id <= 4000000".format(hivedb,table))
sqlContext.sql("insert into table {}.{} select * from mytempTable where id > 4000000 and id <= 5000000".format(hivedb,table))
and so on till the last million
之后的if-else
语句中我手动编写的代码。
我想自动在脚本中生成这部分代码。
如何生成类似的代码行,直到else
中的最后一百万?
答案 0 :(得分:0)
您可以使用简单的循环:
fmt = "insert into table {hivedb}.{table} select * from mytempTable where id > {low} and id <= {hi}"
for low in range(100000, 1000000, 100000):
stmt = fmt.format(low=low, hi=low+100000, hivedb=hivedb, table=table)
sqlContext.sql(stmt)