我正在编写一个python搜索引擎,它读取一个笑话列表,对单词进行标记,创建索引(键:单词,值:单词出现的文档的文档ID),然后搜索索引。标记化和搜索方法工作正常,但我的索引方法给了我一些问题,我无法弄清楚原因。
码
def create_index(tokens):
lst = -1
duplicates =[]
index = {}
docnum = len(tokens)
for list in tokens:
lst += 1
xst = -1
for x in list:
xst += 1
element = tokens[lst][xst]
value = ""
elin = 0
if element not in duplicates:
for r in range(docnum):
if element in tokens[r]:
if elin > 0:
value += " "
value += "%d" % r
elin += 1
print(tokens[r])
print(r)
if len(value) > 1:
index[element] = value.split()
duplicates += element
else:
index[element] = value
duplicates += element
return index
打印语句就在那里,我可以看到问题所在,实际上我看到for循环中的某些r值被完全跳过。我很确定我没有修改我正在迭代的内容,我相信通常会导致循环跳过。
输入(令牌)
[['What', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'warden', 'His', 'dad', 'was', 'in', 'the', 'kitchen', 'poaching', 'eggs'], ['What', 'do', 'you', 'call', 'a', 'chicken', 'crossing', 'the', 'road', 'Poultry', 'in', 'motion'], ['What', 'do', 'you', 'call', 'it', 'when', 'a', 'cat', 'sues', 'another', 'cat', 'A', 'Clawsuit'], ['What', 'does', 'an', 'envelope', 'say', 'when', 'you', 'lick', 'it', 'Nothing', 'It', 'just', 'shuts', 'up'], ['How', 'can', 'you', 'tell', 'the', 'ocean', 'is', 'friendly', 'It', 'waves'], ['What', 's', 'black', 'white', 'green', 'and', 'bumpy', 'A', 'pickle', 'wearing', 'a', 'tuxedo'], ['When', 'was', 'meat', 'so', 'high', 'When', 'the', 'cow', 'jumped', 'over', 'the', 'moon'], ['What', 'happened', 'to', 'the', 'wind', 'It', 'blew', 'away'], ['What', 'starts', 'with', 'T', 'is', 'full', 'of', 'T', 'and', 'ends', 'with', 'T', 'A', 'teapot'], ['What', 'is', 'a', 'hermit', 'A', 'girl', 's', 'baseball', 'glove'], ['What', 'does', 'a', 'television', 'have', 'in', 'common', 'with', 'a', 'rabbit', 'His', 'ears'], ['What', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'Why', 'are', 'you', 'always', 'picking', 'on', 'me'], ['What', 'did', 'the', 'guy', 'say', 'when', 'he', 'walked', 'into', 'the', 'bar', 'Ouch'], ['How', 'is', 'a', 'locksmith', 'like', 'a', 'typewritter', 'They', 'both', 'have', 'a', 'lot', 'of', 'keys'], ['What', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'A', 'cow', 'with', 'a', 'cold'], ['What', 'is', 'a', 'caterpillar', 'afraid', 'of', 'A', 'dogerpillar'], ['Who', 'has', 'the', 'strongest', 'underwear', 'Arnold', 'Shortsineger'], ['Why', 'did', 'the', 'elephant', 'decide', 'not', 'to', 'move', 'Because', 'he', 'couldn', 't', 'lift', 'his', 'trunk'], ['Which', 'are', 'the', 'stronger', 'days', 'of', 'the', 'week', 'Saturday', 'and', 'Sunday', 'The', 'rest', 'are', 'weekdays'], ['Which', 'runs', 'faster', 'hot', 'or', 'cold', 'Hot', 'Everyone', 'can', 'catch', 'a', 'cold'], ['Why', 'did', 'the', 'strawberry', 'cross', 'the', 'road', 'Because', 'his', 'mother', 'was', 'in', 'a', 'jam'], ['How', 'do', 'you', 'keep', 'a', 'lion', 'from', 'charging', 'Take', 'away', 'its', 'credit', 'cards'], ['What', 'do', 'you', 'call', 'a', 'cow', 'with', 'a', 'twitch', 'Beef', 'jerky'], ['What', 'is', 'the', 'best', 'way', 'to', 'keep', 'water', 'from', 'running', 'Don', 't', 'pay', 'the', 'water', 'bill'], ['Where', 'do', 'cows', 'go', 'to', 'have', 'fun', 'The', 'moovies'], ['What', 'time', 'was', 'is', 'when', 'the', 'elephant', 'sat', 'on', 'a', 'chair', 'Time', 'to', 'get', 'a', 'new', 'chair'], ['What', 'did', 'the', 'flower', 'say', 'to', 'the', 'bike', 'Petal'], ['Why', 'do', 'we', 'not', 'tell', 'secrets', 'in', 'a', 'corn', 'patch', 'Too', 'many', 'ears'], ['What', 's', 'yellow', 'and', 'writes', 'A', 'ball', 'point', 'banana'], ['Did', 'people', 'laugh', 'when', 'the', 'lady', 'fell', 'on', 'the', 'ice', 'No', 'but', 'the', 'ice', 'cracked', 'up'], ['What', 'word', 'is', 'always', 'spelled', 'incorrectly', 'Incorrectly'], ['What', 'should', 'you', 'do', 'if', 'your', 'dog', 'is', 'missing', 'Check', 'the', 'Lost', 'and', 'Hound'], ['What', 'have', 'you', 'seen', 'that', 'you', 'will', 'never', 'see', 'again', 'Yesterday'], ['Why', 'did', 'the', 'turkey', 'cross', 'the', 'road', 'To', 'get', 'to', 'the', 'chicken'], ['Why', 'was', 'Cinderella', 'Late', 'for', 'the', 'ball', 'She', 'forgot', 'to', 'swing', 'the', 'bat'], ['How', 'can', 'you', 'tell', 'there', 's', 'a', 'hippo', 'in', 'your', 'oven', 'The', 'oven', 'door', 'won', 't', 'close'], ['What', 'do', 'you', 'get', 'when', 'you', 'cross', 'a', 'shark', 'and', 'Flipper', 'A', 'fat', 'shark'], ['What', 'did', 'the', 'lamp', 'say', 'to', 'the', 'other', 'lamp', 'You', 'turn', 'me', 'on'], ['What', 'did', 'the', 'sidewalk', 'do', 'when', 'he', 'heard', 'a', 'funny', 'joke', 'He', 'cracked', 'up'], ['Why', 'did', 'the', 'chicken', 'cross', 'the', 'road', 'He', 'wanted', 'to', 'prove', 'to', 'the', 'armadillo', 'that', 'it', 'could', 'be', 'done'], ['Why', 'did', 'the', 'dalmation', 'need', 'glasses', 'He', 'was', 'seeing', 'spots'], ['What', 'did', 'the', 'book', 'say', 'to', 'the', 'page', 'Don', 't', 'turn', 'away', 'from', 'me'], ['What', 'do', 'you', 'call', 'a', 'pig', 'in', 'a', 'butcher', 'shop', 'A', 'pork', 'chop'], ['In', 'France', 'what', 'do', 'frogs', 'eat', 'French', 'Flies'], ['What', 'is', 'yellow', 'and', 'wears', 'a', 'mask', 'The', 'Lone', 'Lemon'], ['What', 'do', 'astronauts', 'eat', 'for', 'dinner', 'Launch', 'meat'], ['Why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'Because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'], ['What', 'did', 'they', 'digital', 'clock', 'say', 'to', 'its', 'mom', 'Look', 'mom', 'no', 'hands'], ['What', 'is', 'the', 'best', 'way', 'to', 'raise', 'a', 'child', 'In', 'an', 'elevator'], ['What', 'is', 'always', 'behind', 'the', 'time', 'The', 'back', 'of', 'the', 'clock'], ['What', 's', 'green', 'and', 'sings', 'Elvis', 'Parsley'], ['What', 'has', '10', 'letters', 'and', 'starts', 'with', 'gas', 'An', 'automobile'], ['Why', 'did', 'they', 'bury', 'the', 'battery', 'Because', 'it', 'was', 'dead'], ['Why', 'did', 'the', 'chicken', 'go', 'to', 'the', 'library', 'To', 'check', 'out', 'a', 'bawk', 'bawk', 'bawkbawk'], ['If', 'a', 'woodchuck', 'had', 'a', 'name', 'what', 'would', 'it', 'be', 'Chuck', 'Wood'], ['How', 'do', 'you', 'catch', 'a', 'squirrel', 'Climb', 'a', 'tree', 'and', 'act', 'like', 'a', 'nut'], ['What', 'did', 'one', 'penny', 'say', 'to', 'the', 'other', 'Let', 's', 'get', 'together', 'and', 'make', 'some', 'sense'], ['Why', 'don', 't', 'lobsters', 'share', 'Because', 'they', 'are', 'shellfish'], ['Why', 'does', 'the', 'man', 'wish', 'he', 'could', 'be', 'a', 'guitar', 'player', 'in', 'a', 'room', 'full', 'of', 'beautiful', 'girls', 'Because', 'if', 'he', 'was', 'a', 'guitar', 'player', 'he', 'would', 'have', 'his', 'pick'], ['What', 'is', 'a', 'cat', 's', 'favorite', 'color', 'Purrple'], ['Why', 'did', 'the', 'ghost', 'float', 'across', 'the', 'road', 'Because', 'he', 'couldn', 't', 'walk'], ['What', 'kind', 'of', 'star', 'could', 'hurt', 'you', 'A', 'shooting', 'star']]
输出(缩短)
['What', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'warden', 'His', 'dad', 'was', 'in', 'the', 'kitchen', 'poaching', 'eggs']
0
['What', 'do', 'you', 'call', 'a', 'chicken', 'crossing', 'the', 'road', 'Poultry', 'in', 'motion']
1
['What', 'do', 'you', 'call', 'it', 'when', 'a', 'cat', 'sues', 'another', 'cat', 'A', 'Clawsuit']
2
['What', 'does', 'an', 'envelope', 'say', 'when', 'you', 'lick', 'it', 'Nothing', 'It', 'just', 'shuts', 'up']
3
['What', 's', 'black', 'white', 'green', 'and', 'bumpy', 'A', 'pickle', 'wearing', 'a', 'tuxedo']
5
['What', 'happened', 'to', 'the', 'wind', 'It', 'blew', 'away']
7
['What', 'starts', 'with', 'T', 'is', 'full', 'of', 'T', 'and', 'ends', 'with', 'T', 'A', 'teapot']
8
['What', 'is', 'a', 'hermit', 'A', 'girl', 's', 'baseball', 'glove']
9
['What', 'does', 'a', 'television', 'have', 'in', 'common', 'with', 'a', 'rabbit', 'His', 'ears']
10
['What', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'Why', 'are', 'you', 'always', 'picking', 'on', 'me']
11
['What', 'did', 'the', 'guy', 'say', 'when', 'he', 'walked', 'into', 'the', 'bar', 'Ouch']
12
['What', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'A', 'cow', 'with', 'a', 'cold']
14
['What', 'is', 'a', 'caterpillar', 'afraid', 'of', 'A', 'dogerpillar']
15
['What', 'do', 'you', 'call', 'a', 'cow', 'with', 'a', 'twitch', 'Beef', 'jerky']
22
['What', 'is', 'the', 'best', 'way', 'to', 'keep', 'water', 'from', 'running', 'Don', 't', 'pay', 'the', 'water', 'bill']
23
['What', 'time', 'was', 'is', 'when', 'the', 'elephant', 'sat', 'on', 'a', 'chair', 'Time', 'to', 'get', 'a', 'new', 'chair']
25
['What', 'did', 'the', 'flower', 'say', 'to', 'the', 'bike', 'Petal']
26
['What', 's', 'yellow', 'and', 'writes', 'A', 'ball', 'point', 'banana']
28
['What', 'word', 'is', 'always', 'spelled', 'incorrectly', 'Incorrectly']
30
输入是一个列表列表,我们的想法是搜索并将每个唯一的单词作为一个键放入,其中的值是它出现的文档。但是生成的索引
索引
{'walked': ['12'], 'across': ['60'], 'incorrectly': ['30'], 'Nothing': '3', 'keys': ['13'], 'frogs': ['43'], 'meat': ['6', '45'], 'be': ['39', '54', '58'], 'mother': ['20'], 'Petal': ['26'], 'typewritter': ['13'], 'see': ['32'], 'keep': ['21', '23'], 'jam': ['20'], 'black': '5', 'lick': '3', 'girls': ['58'], 'sues': '2', 'sidewalk': ['38'], 'book': ['41'], 'strongest': ['16'], 'shop': ['42'], 'favorite': ['59'], 'A': ['2', '5', '8', '9', '14', '15', '28', '36', '42', '61'], 'seen': ['32'], 'Did': ['29'], 'glove': '9', 'hands': ['47'], 'room': ['58'], 'patch': ['27'], 'T': '8', 'legs': ['14'], 'hippo': ['35'], 'his': ['17', '20', '46', '58'], 'cows': ['24'], 'new': ['25'], 'cards': ['21'], 'you': ['1', '2', '3', '4', '11', '21', '22', '31', '32', '35', '36', '42', '55', '61'], 'Look': ['47'], 'tree': ['55'], 'tell': ['0', '4', '27', '35'], 'missing': ['31'], 'front': ['46'], 'lion': ['21'], 'automobile': ['51'], 'eggs': '0', 'spots': ['40'], 'water': ['23'], 'full': ['8', '58'], 'moon': '6', 'need': ['40'], 'When': '6', 'moovies': ['24'], 'rest': ['18'], 'cat': ['2', '59'], 'boy': '0', 'fell': ['29'], 'hermit': '9', 'dogerpillar': ['15'], 'sings': ['50'], 'couldn': ['17', '60'], 'dad': '0', 'clock': ['47', '49'], 'turkey': ['33'], 'sat': ['25'], 'He': ['38', '39', '40'], 'bike': ['26'], 'seeing': ['40'], 'elevator': ['48'], 'Parsley': ['50'], 'secrets': ['27'], 'They': ['13'], 'flower': ['26'], 'both': ['13'], 'close': ['35'], 'bill': ['23', '46'], 'crossing': '1', 'glasses': ['40'], 'bawkbawk': ['53'], 'many': ['27'], 'penny': ['56'], 'guy': ['12'], 'chop': ['42'], 'had': ['54'], 'oven': ['35'], 'are': ['11', '18', '57'], 'jerky': ['22'], 'some': ['56'], 'Launch': ['45'], 'Ouch': ['12'], 'An': ['51'], 'Where': ['24'], 'caterpillar': ['15'], 'faster': ['19'], 'do': ['1', '2', '21', '22', '24', '27', '31', '36', '38', '42', '43', '45', '55'], 'lamp': ['37'], 'white': '5', 'Hot': ['19'], 'friendly': '4', 'pick': ['58'], 'me': ['11', '37', '41'], 'high': '6', 'always': ['11', '30', '46', '49'], 'crop': ['11'], 'face': ['46'], 'In': ['43', '48'], 'sad': ['46'], 'spelled': ['30'], 'poaching': '0', 'color': ['59'], 'twitch': ['22'], 'To': ['33', '53'], 'door': ['35'], 'Lemon': ['44'], 'jumped': '6', 'lobsters': ['57'], 'raise': ['48'], 'armadillo': ['39'], 'locksmith': ['13'], 'bawk': ['53'], 'It': ['3', '4', '7'], 'astronauts': ['45'], 'Flies': ['43'], 'Too': ['27'], 'digital': ['47'], 'Because': ['17', '20', '46', '52', '57', '58', '60'], 'joke': ['38'], 'ocean': '4', 'make': ['56'], 'hurt': ['61'], 'go': ['24', '53'], 'four': ['14'], 'with': ['8', '10', '14', '22', '51'], 'You': ['37'], 'laugh': ['29'], 'Don': ['23', '41'], 'Hound': ['31'], 'tuxedo': '5', 'farmer': ['11'], 'yellow': ['28', '44'], 'that': ['32', '39'], 'Wood': ['54'], 'if': ['31', '58'], 'eat': ['43', '45'], 'star': ['61'], 'happened': '7', 'catch': ['19', '55'], 'again': ['32'], 'check': ['53'], 'trunk': ['17'], 'together': ['56'], 'what': ['43', '54'], 'ice': ['29'], 'cross': ['20', '33', '36', '39'], 'banana': ['28'], 'mask': ['44'], 'Time': ['25'], 'Incorrectly': ['30'], 'underwear': ['16'], 'goes': ['14'], 'Cinderella': ['34'], 'they': ['47', '52', '57'], 'does': ['3', '10', '58'], 'child': ['48'], 'is': ['4', '8', '9', '13', '15', '23', '25', '30', '31', '44', '46', '48', '49', '59'], 'lady': ['29'], 'game': '0', 'the': ['0', '1', '4', '6', '7', '11', '12', '16', '17', '18', '20', '23', '25', '26', '29', '31', '33', '34', '37', '38', '39', '40', '41', '46', '48', '49', '52', '53', '56', '58', '60'], 'chicken': ['1', '33', '39', '53'], 'float': ['60'], 'His': ['0', '10'], 'mom': ['47'], '10': ['51'], 'people': ['29'], 'elephant': ['17', '25'], 'one': ['56'], 'will': ['32'], 'wish': ['58'], 'television': ['10'], 'can': ['4', '19', '35'], 'ends': '8', 'duck': ['46'], 'fat': ['36'], 'chair': ['25'], 'Clawsuit': '2', 'kind': ['61'], 'common': ['10'], 'its': ['21', '47'], 'wind': '7', 'shooting': ['61'], 'best': ['23', '48'], 'time': ['25', '49'], 'Poultry': '1', 'no': ['47'], 'not': ['17', '27'], 'picking': ['11'], 'won': ['35'], 'sees': ['46'], 'on': ['11', '25', '29', '37'], 'funny': ['38'], 'pig': ['42'], 'of': ['8', '13', '15', '18', '46', '49', '58', '61'], 'behind': ['49'], 'an': ['3', '48'], 'little': ['0', '46'], 'word': ['30'], 'point': ['28'], 'get': ['25', '33', '36', '56'], 'don': ['57'], 'could': ['39', '58', '61'], 'in': ['0', '1', '10', '20', '27', '35', '42', '46', '58'], 'away': ['7', '21', '41'], 'Everyone': ['19'], 'Let': ['56'], 'swing': ['34'], 'it': ['2', '3', '39', '52', '54'], 'Late': ['34'], 'way': ['23', '48'], 'rabbit': ['10'], 'to': ['7', '11', '17', '23', '24', '25', '26', '33', '34', '37', '39', '41', '47', '48', '53', '56'], 'have': ['10', '13', '24', '32', '58'], 'just': '3', 'heard': ['38'], 'over': '6', 'shark': ['36'], 'so': ['6', '46'], 'teapot': '8', 'strawberry': ['20'], 'from': ['21', '23', '41'], 'call': ['1', '2', '22', '42'], 'Lone': ['44'], 'weekdays': ['18'], 'ghost': ['60'], 'bumpy': '5', 'or': ['19'], 'girl': '9', 'Chuck': ['54'], 'guitar': ['58'], 'we': ['27'], 'Why': ['11', '17', '20', '27', '33', '34', '39', '40', '46', '52', '53', '57', '58', '60'], 'bat': ['34'], 'there': ['35'], 'The': ['18', '24', '35', '44', '49'], 'your': ['31', '35'], 'would': ['54', '58'], 'act': ['55'], 'writes': ['28'], 'No': ['29'], 'pickle': '5', 'beautiful': ['58'], 'running': ['23'], 'corn': ['27'], 'week': ['18'], 'wears': ['44'], 'French': ['43'], 'walk': ['60'], 'name': ['54'], 'sense': ['56'], 'never': ['32'], 'Take': ['21'], 'out': ['53'], 'woodchuck': ['54'], 'pay': ['23'], 'shellfish': ['57'], 'cold': ['14', '19'], 'letters': ['51'], 'butcher': ['42'], 'another': '2', 'runs': ['19'], 'ball': ['28', '34'], 'Sunday': ['18'], 'up': ['3', '29', '38'], 'like': ['13', '55'], 'page': ['41'], 'man': ['58'], 'If': ['54'], 'Which': ['18', '19'], 'nut': ['55'], 'share': ['57'], 'say': ['3', '11', '12', '26', '37', '41', '47', '56'], 'How': ['4', '13', '21', '35', '55'], 'Beef': ['22'], 'wanted': ['39'], 'dalmation': ['40'], 'baseball': '9', 'dinner': ['45'], 'ears': ['10', '27'], 'What': ['0', '1', '2', '3', '5', '7', '8', '9', '10', '11', '12', '14', '15', '22', '23', '25', '26', '28', '30', '31', '32', '36', '37', '38', '41', '42', '44', '45', '47', '48', '49', '50', '51', '56', '59', '61'], 'has': ['14', '16', '51'], 'prove': ['39'], 'Lost': ['31'], 'road': ['1', '20', '33', '39', '60'], 'green': ['5', '50'], 'wearing': '5', 'fun': ['24'], 'move': ['17'], 'cracked': ['29', '38'], 'shuts': '3', 'was': ['0', '6', '20', '25', '34', '40', '52', '58'], 'blew': '7', 'stronger': ['18'], 'Climb': ['55'], 'motion': '1', 'decide': ['17'], 'kitchen': '0', 'turn': ['37', '41'], 'cow': ['6', '14', '22'], 'he': ['12', '17', '38', '46', '58', '60'], 'days': ['18'], 'Yesterday': ['32'], 'envelope': '3', 'Arnold': ['16'], 'hot': ['19'], 'waves': '4', 'She': ['34'], 'bury': ['52'], 'pork': ['42'], 'afraid': ['15'], 'when': ['2', '3', '12', '25', '29', '36', '38'], 'other': ['37', '56'], 'warden': '0', 'did': ['0', '11', '12', '17', '20', '26', '33', '37', '38', '39', '40', '41', '47', '52', '53', '56', '60'], 'Purrple': ['59'], 'forgot': ['34'], 'Check': ['31'], 'charging': ['21'], 'and': ['5', '8', '14', '18', '28', '31', '36', '44', '50', '51', '55', '56'], 'player': ['58'], 'gas': ['51'], 'for': ['34', '45'], 'dog': ['31'], 'Flipper': ['36'], 'Elvis': ['50'], 'library': ['53'], 'credit': ['21'], 'Shortsineger': ['16'], 'but': ['29'], 'battery': ['52'], 'lot': ['13'], 'lift': ['17'], 'booo': ['14'], 'Who': ['16'], 'Saturday': ['18'], 'back': ['49'], 'should': ['31'], 'starts': ['8', '51'], 'done': ['39'], 'France': ['43'], 'squirrel': ['55'], 'into': ['12'], 'dead': ['52'], 'bar': ['12']}
不正确,因为我的for循环正在跳过步骤。循环的数量是否可能与循环混乱?添加两个打印语句进行调试后,输出长度为11373行。如果是这样,有人可以推荐更好的选择吗?如果我错过了一个明显的答案,请道歉。
答案 0 :(得分:1)
我不确定我是否理解正确,但是:
from collections import defaultdict
def create_index(tokens):
index = defaultdict(list)
for docid, ls in enumerate(tokens):
for word in set(ls):
index[word].append(docid)
return index
这使index
字典从单词到文档ID列表:
>>> tokens = ["this is a test".split(), "that is another test".split()]
>>> create_index(tokens)
defaultdict(<class 'list'>, {'test': [0, 1], 'is': [0, 1], 'this': [0], 'that': [1], 'a': [0], 'another': [1]})