在论坛的帮助下,我制作了一个脚本,可以捕获此页面https://www.inforge.net/xi/forums/liste-proxy.1118/主题的所有链接。这些主题包含代理列表。脚本是这样的:
import urllib.request, re
from bs4 import BeautifulSoup
url = "https://www.inforge.net/xi/forums/liste-proxy.1118/"
soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")
base = "https://www.inforge.net/xi/"
for tag in soup.find_all("a", {"class":"PreviewTooltip"}):
links = tag.get("href")
final = [base + links]
final2 = urllib.request.urlopen(final)
for line in final2:
ip = re.findall("(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3}):(?:[\d]{1,5})", line)
ip = ip[3:-1]
for addr in ip:
print(addr)
输出结果为:
Traceback (most recent call last):
File "proxygen5.0.py", line 13, in <module>
sourcecode = urllib.request.urlopen(final)
File "/usr/lib/python3.5/urllib/request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.5/urllib/request.py", line 456, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
我知道问题出在以下部分:final2 = urllib.request.urlopen(final)
但我不知道如何解决
我该怎么做才能打印ips?
答案 0 :(得分:2)
此代码应该执行您想要的操作,并对其进行评论,以便您了解所有段落:
import urllib.request, re
from bs4 import BeautifulSoup
url = "https://www.inforge.net/xi/forums/liste-proxy.1118/"
soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")
base = "https://www.inforge.net/xi/"
# Iterate over all the <a> tags
for tag in soup.find_all("a", {"class":"PreviewTooltip"}):
# Get the link form the tag
link = tag.get("href")
# Compose the new link
final = base + link
print('Request to {}'.format(final)) # To know what we are doing
# Download the 'final' link content
result = urllib.request.urlopen(final)
# For every line in the downloaded content
for line in result:
# Find one or more IP(s), here we need to convert lines to string because `bytes` objects are given
ip = re.findall("(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3}):(?:[\d]{1,5})", str(line))
# If one ore more IP(s) are found
if ip:
# Print them on separate line
print('\n'.join(ip))