如果一个数据框中的一行与其他数据框中的所有行相交,则使用PySpark获取所有具有相交词的行

时间:2018-07-12 15:56:31

标签: pyspark apache-spark-sql pyspark-sql

我有两个数据框,其中一个包含列skill_idskill_set,另一个包含jobtitlejob_description,如下所示

skills dataframe:
    +---------+--------------------+
    | skill_id|           skill_set|
    +---------+--------------------+
    |100000001|python, numpy, pa...|
    |100000002|java, j2ee, hiber...|
    |100000003|c#, asp.net, .net...|
    |100000004|agile, product ba...|
    |100000005|deep-learning, py...|
    |100000006|database, oracle,...|
    |100000007|java, c, .net, da...|
    |100000008|html, html5, java...|
    |100000009|mongodb, expressj...|
    |100000010|jira, confluence,...|
    |100000011|automatic testing...|
    |100000012|mvp, mvvm, sdk, a...|
    |100000013|objective c, swif...|
    |100000014|codeigniter, php,...|
    +---------+--------------------+

descriptions dataframe:
    +--------------------+--------------------+
    |            jobtitle|     job_description|
    +--------------------+--------------------+
    |Python developer ...|this is tarannum ...|
    |java developer in...|experience with j...|
    |.net developer in...|design and develo...|
    |scrum master in g...|leading one or mo...|
    |data scientist fo...|must be proficien...|
    |data base adminis...|strong 3+ year ex...|
    |full stack develo...|12+ years of fron...|
    |ui/ux developer i...|html5, css, javas...|
    |mean stack develo...|hands on experien...|
    |devops engineer i...|drive the archite...|
    |testing engineer ...|seeking highly mo...|
    |android developer...|functional knowle...|
    |ios developer in ...|working knowledge...|
    |ios developer in ...|We are looking fo...|
    |python developer ...|Vast knowledge in...|
    |Python Developer ...|We are looking fo...|
    |Senior Java Devel...|We are looking fo...|
    |php developer at ...|CodeIgniter (Must...|
    +--------------------+--------------------+

现在,我想在技能数据框中输入一行,例如100000001,并与他们的skill_set进行比较,并与所有职位描述数据框进行比较。应该显示描述数据帧的所有行,其中包含相交的单词为100000001行。我正在搜索如何使用PySpark在数据帧的一行与其他数据帧的所有行上应用交集。

希望它能理解。如果知道相同的型号,请提供示例链接。

谢谢

0 个答案:

没有答案