以下代码每秒摄取10k-20k记录,我想提高它的性能。我正在阅读json格式并使用Kafka将其摄取到数据库中。 - 我在五个节点的集群上运行它,并在其上安装了zookeeper和Kafka。
你能给我一些改进的提示吗?
import os
import json
from multiprocessing import Pool
from kafka.client import KafkaClient
from kafka.producer import SimpleProducer
def process_line(line):
producer = SimpleProducer(client)
try:
jrec = json.loads(line.strip())
producer.send_messages('twitter2613',json.dumps(jrec))
except ValueError, e:
{}
if __name__ == "__main__":
client = KafkaClient('10.62.84.35:9092')
myloop=True
pool = Pool(30)
direcToData = os.listdir("/FullData/RowData")
for loop in direcToData:
mydir2=os.listdir("/FullData/RowData/"+loop)
for i in mydir2:
if myloop:
with open("/FullData/RowData/"+loop+"/"+i) as source_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(process_line, source_file, 30)
答案 0 :(得分:0)
您可以只导入操作系统所需的功能。它可以是第一次优化。