Question

我是火花的新手。我现在正在研究中文n-gram，试图练习和利用火花加速。

我在本地模式下在单台机器上运行，我的机器中有32GB内存。

但是我的火花程序运行2个小时，8核16线程仍在运行，纯python程序运行1或2分钟。

数据来自mongodb，我尝试并行化我的内容。

这是我的代码：

SparkContext.setSystemProperty('spark.executor.memory', '16g')
sc = SparkContext("local[*]", 'dcard')

my_spark = SparkSession \
    .builder \
    .appName("dcard") \
    .config("spark.mongodb.input.uri", "mongodb://192.168.2.12:27017/dcard.talk_posts") \
    .config("spark.mongodb.output.uri", "mongodb://192.168.2.12:27017/dcard.talk_posts") \
    .getOrCreate()

df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
content = df.select('content')
content_rdd = content.rdd
paralle_data = sc.parallelize(content_rdd.collect()).cache()

def remove_url_and_punctuation(sentence):
    # remove url
    if 'http' in sentence:
        sentence = re.sub(r'^https?:\/\/.*[\r\n]*', '', sentence, flags=re.MULTILINE)

    # remove punctuation
    text_list = re.split('\W+', sentence)
    return list(filter(None, text_list))

def one_to_three_grams(line):
    return (Counter(line), to_ngrams(line, 2), to_ngrams(line, 3))

def to_ngrams(unigrams, length):
    return Counter(zip(*[unigrams[i:] for i in range(length)]))

result = paralle_data.flatMap(lambda s: remove_url_and_punctuation(s['content'])).map(lambda line: one_to_three_grams(line)).reduce(lambda a, b: tuple(map(operator.add, a, b)))

数据格式在这里：

print(paralle_data.top(1))
[Row(content='\n聽說今年在屏東某地的潮X高中\n全國繁星第一 (110人)\n但只有46個人上國立\n難道這就是所謂有學校就讀的概念嗎?\n\n還有據說繁星進大學的 都蠻優秀的\n是這樣嗎？')]

remove_url_and_punctuation(paralle_data.top(1)[0]['content'])
['聽說今年在屏東某地的潮X高中',
 '全國繁星第一',
 '110人',
 '但只有46個人上國立',
 '難道這就是所謂有學校就讀的概念嗎',
 '還有據說繁星進大學的',
 '都蠻優秀的',
 '是這樣嗎']

结果应该是一个元组和（one_grams_counter，two_grams_counter，three_grams_counter）

我是否会错过火花中重要的事情？

UPDATE1：

采取shanmuga的建议。我更新了我的代码。

content_rdd = content.rdd
paralle_data = sc.parallelize(content_rdd.collect()).cache()
result = paralle_data.flatMap(lambda s: remove_url_and_punctuation(s['content'])).map(lambda line: one_to_three_grams(line)).reduce(lambda a, b: tuple(map(operator.add, a, b)))

到

content_rdd = content_rdd.repartition(16)

result = content_rdd.flatMap(lambda s: remove_url_and_punctuation(s['content'])).map(lambda line: one_to_three_grams(line)).reduce(lambda a, b: tuple(map(operator.add, a, b)))

它运行3个小时仍在运行。

UPDATE2：

这是我的数据大小，这只是我总数据的十分之一。解决问题后，我会将其应用于我的总数据。

和纯python代码。

spark运行主UI截图。

2.5小时后跑步。

在n-gram

0 个答案: