我有这个代码循环遍历Top Alexa 1000
个网站,并以任何形式获得允许Sign Up
或Login
的网站。如果在此循环的某个迭代中有一个网站被卡住或以任何形式抛出Exception
,我会从列表中删除它,然后使用下一个元素重新开始循环。我在selenium
中使用Python
包来执行此操作。它工作正常,除了由于某种原因它循环我的alexa_1000
包含列表的变量中的每个其他元素(即跳过一个元素),而不是遍历每个元素。有人可以帮忙吗?我可以看到的代码似乎没有任何问题,我一直在调试它以查看程序流程,但实际上无法弄清楚发生了什么。程序的一般流程似乎没问题。当我打印每个循环的索引,看到跳过的性质,从0到1到2到3的意义上看起来也很好。很高兴有任何帮助。这是代码:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
def get_alexa_top_pages():
sites = []
with open('topsites_1000.txt', 'r') as f:
for line in f:
line = line.strip('\n')
sites.append(line)
sites = filter(None, sites)
sites = ['http://www.' + site for site in sites]
return sites
def main():
alexa_1000 = get_alexa_top_pages()
out = open('sites_with_login.txt', 'w')
sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']
driver = webdriver.Firefox()
driver.set_page_load_timeout(30)
for index, page in enumerate(alexa_1000):
try:
print "Loading page %s (num %d)" %(page, index + 1)
driver.get(page)
html_source = driver.page_source
html_source = html_source.lower()
present = any([i in html_source for i in sign_in_strings])
if present:
out.write(page + '\n')
alexa_1000.remove(page)
except TimeoutException as ex:
alexa_1000.remove(page)
continue
except Exception as ex:
alexa_1000.remove(page)
continue
out.close()
if __name__ == "__main__":
main()
答案 0 :(得分:2)
有不同的方法来摆脱这个问题。问题是因为您在枚举枚举时触摸了枚举。应始终避免这种情况。您可以通过重写代码来实现这一目标
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
def get_alexa_top_pages():
sites = []
with open('topsites_1000.txt', 'r') as f:
for line in f:
line = line.strip('\n')
sites.append(line)
sites = filter(None, sites)
sites = ['http://www.' + site for site in sites]
return sites
def main():
alexa_1000 = set(get_alexa_top_pages())
alexa_invalid = set()
out = open('sites_with_login.txt', 'w')
sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']
driver = webdriver.Firefox()
driver.set_page_load_timeout(30)
for index, page in enumerate(alexa_1000):
try:
print "Loading page %s (num %d)" %(page, index + 1)
driver.get(page)
html_source = driver.page_source
html_source = html_source.lower()
present = any([i in html_source for i in sign_in_strings])
if present:
out.write(page + '\n')
except TimeoutException as ex:
alexa_invalid.add(page)
continue
except Exception as ex:
alexa_invalid.add(page)
continue
alexa_valid = alexa_1000 - alexa_invalid
out.close()
if __name__ == "__main__":
main()
在此使用set,一个用于循环,一个用于维护无效列表。如果发生异常,则更新无效的异常。最后,您可以减去两个以找到有效的网站
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
def get_alexa_top_pages():
sites = []
with open('topsites_1000.txt', 'r') as f:
for line in f:
line = line.strip('\n')
sites.append(line)
sites = filter(None, sites)
sites = ['http://www.' + site for site in sites]
return sites
def main():
alexa_1000 = get_alexa_top_pages()
out = open('sites_with_login.txt', 'w')
sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']
driver = webdriver.Firefox()
driver.set_page_load_timeout(30)
for index, page in enumerate(alexa_1000[::-1]):
try:
print "Loading page %s (num %d)" %(page, index + 1)
driver.get(page)
html_source = driver.page_source
html_source = html_source.lower()
present = any([i in html_source for i in sign_in_strings])
if present:
out.write(page + '\n')
except TimeoutException as ex:
alexa_1000.pop()
continue
except Exception as ex:
alexa_1000.pop()
continue
out.close()
if __name__ == "__main__":
main()
在这个你以相反的顺序循环和那些错误的循环,你只需将它们弹出。最后alexa_1000
将包含您处理的所有有效网站
有很多方法可以解决这个问题,上面只展示了其中2个你可以理想使用的方法