火花作业花费太长时间

时间:2019-05-05 12:33:02

标签: pyspark

在尝试执行搜索任务时,我有 40个文件,每个文件包含 1M行,类似于下面的

T95X106M016CAAL T95X106M016CAAL
T95X106M016CAAS T95X106M016CAAS
T95X106M016CABL T95X106M016CABL
T95X106M016CABS T95X106M016CABS
T95X106M016CASL T95X106M016CASL
T95X106M016CASS T95X106M016CASS
T95X106M016CSAL T95X106M016CSAL
T95X106M016CSAS T95X106M016CSAS

,我还有另一个关键字文件,其中包含与以下内容类似的 130k

CONNECTOR
PHYSIOTHERAPY
CONSULTANCY
PHYSIOTHERAPY
CONSULTANCY
DIGITAL
PIC18F26K80-H/MM
T95X106M016CABL

该作业有望从关键字文件中列出的 40M 行输出匹配行

代码段

def filterMatchRegex(pair):
    matchedKeywords=list()

    f, text = pair
    # print(f,text)

    rule = SERule(txt=text)
    # print('rule is ', rule)
    matched_lines = rule.get_matching_lines()

    # print('matched_lines', matched_lines)
    matchedKeywords.append(matched_lines)

    return f,matchedKeywords


file_name_content_rdd_manual = sc.wholeTextFiles("hdfs://path/to/files").map(filterMatchRegex).take(1)

该作业已经运行了很长时间(6天),但是没有输出,这是怎么了?

我使用--master yarn --deploy-mode cluster选项运行作业

pycharm仅输出...

19/05/05 14:34:31 INFO yarn.Client: Application report for application_1554300516095_0079 (state: RUNNING)
19/05/05 14:34:32 INFO yarn.Client: Application report for application_1554300516095_0079 (state: RUNNING)
19/05/05 14:34:33 INFO yarn.Client: Application report for application_1554300516095_0079 (state: RUNNING)
19/05/05 14:34:34 INFO yarn.Client: Application report for application_1554300516095_0079 (state: RUNNING)

0 个答案:

没有答案