如何使用Spark查找两个hadoop表中的关键字?

时间:2017-10-17 11:56:42

标签: scala hadoop apache-spark hdfs

我在HDFS中有两个表。一个表(表-1)有一些关键字,如下所示。另一个表(表-2)有一个文本列。表-1中的每一行都可以有多个关键字。我需要找到Table-1中表2中文本列的所有匹配关键字,并输出表-2中每一行的关键字列表。

示例:

表1:

def get_list_of_university_towns():
    ....
    uni_towns['State'] = uni_towns['State'].apply(lambda item: item.replace('[edit]', ''))
    return uni_towns

表2:

ID  | Name    | Age | City | Gender
---------------------------------
111 | Micheal | 19  | NY   | male   
222 | George  | 23  | CA   | male
333 | Linda   | 22  | LA   | female

输出:

Text_Description
------------------------------------------------------------------------
1-Linda and my cousin left the house.  
2-Michael who is 19 year old, and George are going to rock concert in CA.
3-Shopping card is ready at the NY for male persons.

0 个答案:

没有答案