在尝试执行搜索任务时,我有 40个文件,每个文件包含 1M行,类似于下面的
T95X106M016CAAL T95X106M016CAAL
T95X106M016CAAS T95X106M016CAAS
T95X106M016CABL T95X106M016CABL
T95X106M016CABS T95X106M016CABS
T95X106M016CASL T95X106M016CASL
T95X106M016CASS T95X106M016CASS
T95X106M016CSAL T95X106M016CSAL
T95X106M016CSAS T95X106M016CSAS
,我还有另一个关键字文件,其中包含与以下内容类似的 130k 行
CONNECTOR
PHYSIOTHERAPY
CONSULTANCY
PHYSIOTHERAPY
CONSULTANCY
DIGITAL
PIC18F26K80-H/MM
T95X106M016CABL
该作业有望从关键字文件中列出的 40M 行输出匹配行
代码段
def filterMatchRegex(pair):
matchedKeywords=list()
f, text = pair
# print(f,text)
rule = SERule(txt=text)
# print('rule is ', rule)
matched_lines = rule.get_matching_lines()
# print('matched_lines', matched_lines)
matchedKeywords.append(matched_lines)
return f,matchedKeywords
file_name_content_rdd_manual = sc.wholeTextFiles("hdfs://path/to/files").map(filterMatchRegex).take(1)
该作业已经运行了很长时间(6天),但是没有输出,这是怎么了?
我使用--master yarn --deploy-mode cluster
选项运行作业
pycharm仅输出...
19/05/05 14:34:31 INFO yarn.Client: Application report for application_1554300516095_0079 (state: RUNNING)
19/05/05 14:34:32 INFO yarn.Client: Application report for application_1554300516095_0079 (state: RUNNING)
19/05/05 14:34:33 INFO yarn.Client: Application report for application_1554300516095_0079 (state: RUNNING)
19/05/05 14:34:34 INFO yarn.Client: Application report for application_1554300516095_0079 (state: RUNNING)