我想使用hadoop来实现一个简单的搜索引擎。
所以我使用hadoop streaming api和bash.which创建了一个倒排索引,输出如下文件:
ab (744 1) 1
abbrevi (122 1) 1
abil (51 1) (77 1) (738 1) 3
abl (99 1) (132 1) (536 1) (581 1) (695 1) (763 1) (908 1) (914 1) (986 1) (1114 2) 10
ablat (82 2) (274 2) (553 7) (587 1) (1065 3) (1096 2) (1097 7) (1098 3) (10Sorry if 99 4) (1100 4) (1101 3) (1226 3) (1241 3) (1279 1) 14
about (27 1) (32 1) (39 1) (46 1) (49 2) (56 1) (57 1) (69 2) (77 2) (81
2) (83 2) (113 1) (134 1) (139 2) (140 1) (155 1) (156 2) (162 1) (163 1) (165 2) (171 1) (174 1) (177 1) (193 5) (205 1) (206 3) (212 1) (216 3) (218 1)
(225 2) (249 3) (255 1) (257 1) (262 1) (266 3) (272 6) (273 1) (285 1) (292
2) (313 1) (315 2) (346 2) (368 1) (370 1) (371 1) (372 1) (373 1) (381 2) (391 1) (410 3) (420 1) (452 1) (456 4) (469 1) (479 1) (489 1) (498 3) (511 1)
(518 1) (531 1) (536 1) (548 1) (555 1) (556 1) (560 2) (565 1) (567 1) (572
1) (575 1) (577 1) (589 1) (601 1) (603 1) (610 1) (612 1) (614 1) (620 1) (621 4) (625 3) (626 1) (646 1) (649 1) (651 2) (657 2) (662 1) (679 1) (685 2)
(686 1) (704 2) (706 2) (709 1) (717 2) (721 1) (740 2) (757 2) (759 1) (774
1) (786 1) (792 2) (793 1) (794 2) (796 2) (801 2) (805 1) (806 1) (807 2) (808 2) (811 1) (815 1) (816 1) (829 2) (844 1) (869 1) (876 1) (912 1) (917 1)
(921 1) (927 1) (928 2) (958 1) (976 6) (991 1) (992 2) (993 1) (994 1) (996
1) (999 1) (1000 1) (1002 1) (1004 2) (1006 1) (1040 1) (1092 1) (1095 2) (1104 4) (1105 1) (1115 1) (1143 4) (1156 2) (1162 1) (1164 3) (1165 1) (1166 3) (1169 1) (1191 1)
(1194 1) (1202 1) (1209 1) (1212 1) (1218 1) (1223 1) (1224 1) (1229 1) (1230 1) (1231
1) (1239 1) (1241 1) (1244 1) (1246 1) (1248 1) (1255 2) (1262 1) (1275 2) (1282 1) (1303 1) (1304 1) (1307 1) (1310 3) (1316 1) (1335 1) (1341 1) (1344 1) (1345 1) (1353 1)
(1354 3) (1355 1) (1363 1) (1377 1) 178
这意味着例如单词ab
在文档编号744中仅重复一次。
现在我想使用hadoop流式api实现and query searching
(这意味着文档应该包含查询中的所有单词)。
那究竟什么是地图并减少搜索阶段?还有,请你给我一些提示,我怎样才能使用流式api实现它? (什么应该是输入字段?),我不知道该怎么办?)
由于
答案 0 :(得分:1)
以下是我对您的查询搜索问题的看法 - 我只是简单地概述了应该做什么,而不是给你代码(无论如何我的bash技能都有点生锈)。
作业设置
首先,您需要对查询进行标记,将标记列表作为逗号分隔列表放入配置值。如果您愿意,可以在mapper / reducer端执行此操作,但我建议将此部件集中在作业设置中。
<强>映射强>
从查询中读取配置值,使其成为“set”或其他具有快速键查找功能的结构。
映射器应将每一行(一个单词映射到n个文档),如果此行中的当前单词位于查询集中,则将其“发送”到HDFS。此阶段应将document-id作为键发出,每个单词作为值(这将创建“n”个输出记录,其中“n”是每个单词的文档数。)
减速机
然后,reducer接收一个document-id作为键和多个令牌,这些令牌将您的查询作为值匹配,现在您再次读取配置值,只是比较一下您是否从本文档的查询中获得了所有令牌。
您应该将document-id作为键发出,并且通常在搜索中输出一些“match-score”作为值。在你的情况下,你只搜索“完整”匹配,所以这个分数实际上并不重要,因为它将是一个常数。
一些改进
在这样做之后想想一些改进,在这种情况下,Mapper会发出所有令牌 - 你真的需要它们作为单独的记录吗?也许您可以使用组合器来节省一些网络带宽?
我将这些作为练习留给读者; - )