我正在尝试从搜索栏中抓取建议。因此,例如,我搜索 apple
,会出现 5 个列出其他水果的建议。我想遍历该建议列表,收集为这 5 种建议水果列出的其他建议。从新的建议列表中,我想检查一下,确保没有重复,然后访问新建议的 url,并继续这样做,直到建议列表中的所有术语都已访问过。
例如,我搜索 apple
。假设建议是 [banana, pear, orange, grape, peach]
然后我访问每个建议以获取新建议。所以说
**Suggested Term** **New Suggestions**
banana [apple, orange, blueberry, strawberry]
pear [peach, plum, grape]
orange [lemon, grapefruit, lime, tangerine]
grape [blueberry, strawberry, blackberry, cherry]
peach [nectarine, plum, apricot]
如您所见,新建议中有重复项。我会检查重复项,删除它们,然后将新的搜索词附加到我想继续迭代的原始列表中。
例如,新建议列表如下new_suggestions = [apple, orange, blueberry, strawberry, peach, plum, grape, lemon, grapefruit, lime, tangerine, blueberry, strawberry, blackberry, cherry, nectarine, plum, apricot]
在与原始建议术语列表进行交叉检查后,我删除 [apple, orange, grape, peach]
以获得 new_suggestions = [blueberry, strawberry, plum, lemon, grapefruit, lime, tangerine, blueberry, strawberry, blackberry, cherry, nectarine, plum, apricot]
然后我删除 new_suggestions 中的重复项以获得: new_suggestions = [blueberry, strawberry, plum, lemon, grapefruit, lime, tangerine, blackberry, cherry, nectarine, apricot]
我将新建议附加到建议术语的原始列表中以获得
[apple, banana, pear, orange, grape, peach, blueberry, strawberry, plum, lemon, grapefruit, lime, tangerine, blackberry, cherry, nectarine, apricot]
我想继续遍历列表,直到访问完所有术语,并且没有更多建议添加到列表中。我该怎么做?
下面是我的代码:
#get suggestions listed for first search term
suggestions = driver.find_element_by_xpath('//*[@id="search-associates"]').find_elements_by_tag_name('a')
for i in suggestions:
searches +=[i.text]
urls += [i.get_attribute('href')]
#remove last entry because it is a blank
urls.pop()
#iterate through the url of each suggestion
for i in urls:
#driver makes new request to each url
driver.get(i)
results += [driver.find_element_by_xpath('/html/body/div/div[4]/h2/span').text]
#since not all urls will have suggestions
try:
new_suggestions = driver.find_element_by_xpath('//*[@id="search-associates"]').find_elements_by_tag_name('a')
for x in new_suggestions:
new_searches+=[x.text]
new_urls += [x.get_attribute('href')]
#remove duplicates if in original url list
new_urls = [elem for elem in new_urls if elem not in urls ]
except:
pass
#remove duplicates if in new_urls list
for y in new_urls:
if y not in newest_urls:
newest_urls.append(y)
#remove blanks
newest_urls = [x for x in newest_urls if x != None]
#add newest urls to original url list to keep iterating through
urls.extend(newest_urls)
感谢您花时间查看我的问题并以任何方式提供帮助。我很感激。