使用python3捕获链接和ips

时间:2016-07-18 22:00:08

标签: python python-3.x hyperlink timeout try-catch

在论坛的帮助下,我制作了一个脚本,可以捕获此页面https://www.inforge.net/xi/forums/liste-proxy.1118/主题的所有链接。这些主题包含代理列表。脚本是这样的:

import urllib.request, re
from bs4 import BeautifulSoup

url = "https://www.inforge.net/xi/forums/liste-proxy.1118/"
soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")

base = "https://www.inforge.net/xi/"

for tag in soup.find_all("a", {"class":"PreviewTooltip"}):
    links = tag.get("href")
    final = [base + links]

final2 = urllib.request.urlopen(final)

for line in final2:
    ip = re.findall("(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3}):(?:[\d]{1,5})", line)
    ip = ip[3:-1]

for addr in ip:
    print(addr)

输出结果为:

Traceback (most recent call last):
  File "proxygen5.0.py", line 13, in <module>
    sourcecode = urllib.request.urlopen(final)
  File "/usr/lib/python3.5/urllib/request.py", line 162, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 456, in open
    req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'

我知道问题出在以下部分:final2 = urllib.request.urlopen(final)但我不知道如何解决

我该怎么做才能打印ips?

1 个答案:

答案 0 :(得分:2)

此代码应该执行您想要的操作,并对其进行评论,以便您了解所有段落:

import urllib.request, re
from bs4 import BeautifulSoup

url = "https://www.inforge.net/xi/forums/liste-proxy.1118/"
soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")

base = "https://www.inforge.net/xi/"

# Iterate over all the <a> tags
for tag in soup.find_all("a", {"class":"PreviewTooltip"}):
    # Get the link form the tag
    link = tag.get("href")
    # Compose the new link
    final = base + link

    print('Request to {}'.format(final))    # To know what we are doing
    # Download the 'final' link content
    result = urllib.request.urlopen(final)

    # For every line in the downloaded content
    for line in result:
        # Find one or more IP(s), here we need to convert lines to string because `bytes` objects are given
        ip = re.findall("(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3}):(?:[\d]{1,5})", str(line))
        # If one ore more IP(s) are found
        if ip:
            # Print them on separate line
            print('\n'.join(ip))