Question

我正在编写一个自定义数据源，该数据源会将我的剩余调用发送的响应（即JSON）转换为数据集。我正在使用jsonPath从json提取必填字段，并创建模式和数据。我正在使用Table scan trait的buildscan将数据转换为rdd。但是不知何故，我的buildscan方法多次被调用，最后我得到的是空数据集。有人可以帮忙解释为什么多次调用buildscan以及如何防止这种情况。

在本地系统-mac中，它工作正常，但是在群集环境中运行时，多次调用buildscan会覆盖数据集，并且将变为空

f = open("bsp_file.txt", encoding="utf-8")
text = f.read()
f.close()



words = []

for word in text.split():
    word = word.strip(",.:;-?!-–—_ ")
    if len(word) != 0:
        words.append(word)

trigrams = {}
for i in range(len(words)):
    word = words[i]
    nextword = words[i + 1]
    nextnextword = words[i + 2]
    key = (word, nextword, nextnextword)
    trigrams[key] = trigrams.get(key, 0) + 1   

l = list(trigrams.items())
l.sort(key=lambda x: x[1])
l.reverse()

for key, count in l:
    if count < 5:
        break
    word = key[0]
    nextword = key[1]
    nextnextword = key[2]
    print(word, nextword, nextnextword, count)

我希望数据集显示表格值。在本地系统中，它正在显示数据集，但是在集群环境中运行时，它会多次调用buildscan

如何在spark自定义数据源中内部调用buildscan方法

0 个答案: