Question

我一直在尝试在大型数据库（30GB）上训练Naive Bayes Classifer。由于内存限制，我必须将数据库查询拆分为多个批次。

我正在使用如下所示的管道：

categoryIndexer = StringIndexer(inputCol="diff", outputCol="label")
tokenizer = Tokenizer(inputCol="text", outputCol="raw")
remover = StopWordsRemover(inputCol="raw", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features",  numFeatures=100000)
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
pipeline = Pipeline(stages=[categoryIndexer, tokenizer, remover, hashingTF, nb])

然后在for循环中使用fit。

for i in range(0,365):
    df = sqlContext.read.jdbc(url=url,table="(SELECT text, diff FROM tweets INNER JOIN djitf ON tweets.created = djitf.day WHERE id > "+ str(i*1000000)+ "AND id <"+ str((i+1)*1000000)+") as table1", properties=properties)
    train_data, test_data = df.randomSplit([0.8, 0.2])
    model = pipeline.fit(train_data)

但是我的结果表明每次调用管道上的fit函数时都会覆盖模型。如何保留已安装的数据，并添加到其中？

是否缺少参数或其他内容？例如，在Sklearn中有partial_fit方法

Answer 1

没有遗漏参数。 Spark不支持增量匹配，不应该是必需的。 Spark可以轻松处理大于内存数据，可能还有磁盘缓存。如果您的资源仍然有30GB的数据，那么您根本不应该使用Spark。

如果问题仅在于读取使用谓词：

predicates = [
    "id > {0} AND id < {1}".format(i * 1000000, (i + 1) * 1000000)
    for i in range(0, 365)
]

df = sqlContext.read.jdbc(
    url=url,
    table="""(SELECT text, diff 
               FROM tweets INNER 
               JOIN djitf ON tweets.created = djitf.day") as table1""", 
     predicates=predicates,
     properties=properties)

或JDBC阅读器的范围：

df = sqlContext.read.jdbc(
    url=url,
    table="""(SELECT cast(id, INTEGER), text, diff 
               FROM tweets INNER 
               JOIN djitf ON tweets.created = djitf.day") as table1""",
    column="id", lowerBound=0, upperBound=366 * 1000000, numPartitions=300)

Pyspark朴素贝叶斯分批使用

1 个答案: