我在这里有一个免费代理脚本,但现在有一个错误:
回溯(最近通话最近一次):
文件“ proxi.py”,位于
的第14行if(td [6] .text ==“ no”):#如果将“ no”更改为“ yes”,则会得到https
IndexError:列表索引超出范围
import requests
from bs4 import BeautifulSoup
out = ""
urls = ["http://www.us-proxy.org/","http://free-proxy-list.net/uk-proxy.html","http://free-proxy-list.net/anonymous-proxy.html"]
for url in urls:
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
tr = soup.find_all("tr")
for t in tr:
td = t.find_all("td")
if (td):
if (td[6].text=="no"): # If you change "no" to "yes" you get https
out+=(td[0].text+":"+td[1].text+"\n")
f = open("proxy.txt", "w")
f.write(out)
f.close()
答案 0 :(得分:1)
td并不总是在第6点有索引
因此,当您执行td [6]时,会给您一个索引错误
在这段代码中,我打印出td的长度 https://onlinegdb.com/BkcnRSZgr
import requests
from bs4 import BeautifulSoup
out = ""
urls = ["http://www.us-proxy.org/","http://free-proxy-list.net/uk-proxy.html","http://free-proxy-list.net/anonymous-proxy.html"]
for url in urls:
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
tr = soup.find_all("tr")
for t in tr:
td = t.find_all("td")
print(len(td))
if (td):
if (td[6].text=="yes"): # If you change "no" to "yes" you get https
out+=(td[0].text+":"+td[1].text+"\n")
f = open("proxy.txt", "w")
f.write(out)
f.close()
这是我所发生的事的一个例子。 https://onlinegdb.com/HJIXlL-xH
希望它能帮助您更好地理解。
答案 1 :(得分:1)
这些URL具有相似的标记结构:
urls = ["http://www.us-proxy.org/","http://free-proxy-list.net/uk-proxy.html","http://free-proxy-list.net/anonymous-proxy.html"]
有一个ID为proxylisttable
的表,其中包含具有页眉行和页脚行的代理列表。
我建议将tr
的选择限制在此表之内,例如
trs = bs.select("table#proxylisttable tr")
proxies = trs[1:-1] # exclude heading and footer