免费代理上市网站

时间:2018-01-24 15:58:20

标签: python web-scraping

我正试图抓住一个免费代理列表网站,但是,我无法抓住代理。

以下是我的代码:

import requests
import re

url = 'https://free-proxy-list.net/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}

source = requests.get(url, headers=headers, timeout=10).text

proxies = re.findall(r'([0-9]{1,3}\.){3}[0-9]{1,3}(:[0-9]{2,4})?', source)

print(proxies)

如果有人可以帮助我而不使用像BeautifulSoup这样的额外库/模块,我将非常感激。

4 个答案:

答案 0 :(得分:5)

通常最好使用BeautifulSoup等解析器来提取html的额外数据,而不是正则表达式,因为很难重现BeautifulSoup的精确度;但是,你可以用纯正则表达式来试试这个:

import re
url = 'https://free-proxy-list.net/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
source = str(requests.get(url, headers=headers, timeout=10).text    https://free-proxy-list.net/)
data = [list(filter(None, i))[0] for i in re.findall('<td class="hm">(.*?)</td>|<td>(.*?)</td>', source)]
groupings = [dict(zip(['ip', 'port', 'code', 'using_anonymous'], data[i:i+4])) for i in range(0, len(data), 4)]

样本输出(实际长度为300):

[{'ip': '47.88.242.10', 'port': '80', 'code': 'SG', 'using_anonymous': 'anonymous'}, {'ip': '118.189.172.136', 'port': '80', 'code': 'SG', 'using_anonymous': 'elite proxy'}, {'ip': '147.135.210.114', 'port': '54566', 'code': 'PL', 'using_anonymous': 'anonymous'}, {'ip': '5.148.150.155', 'port': '8080', 'code': 'GB', 'using_anonymous': 'elite proxy'}, {'ip': '186.227.8.21', 'port': '3128', 'code': 'BR', 'using_anonymous': 'anonymous'}, {'ip': '49.151.155.60', 'port': '8080', 'code': 'PH', 'using_anonymous': 'anonymous'}, {'ip': '52.170.255.17', 'port': '80', 'code': 'US', 'using_anonymous': 'anonymous'}, {'ip': '51.15.35.239', 'port': '3128', 'code': 'NL', 'using_anonymous': 'elite proxy'}, {'ip': '163.172.27.213', 'port': '3128', 'code': 'GB', 'using_anonymous': 'elite proxy'}, {'ip': '94.137.31.214', 'port': '8080', 'code': 'RU', 'using_anonymous': 'anonymous'}]

编辑:连接ip和端口,迭代每个分组并使用字符串格式:

final_groupings = [{'full_ip':"{ip}:{port}".format(**i)} for i in groupings]

输出:

[{'full_ip': '47.88.242.10:80'}, {'full_ip': '118.189.172.136:80'}, {'full_ip': '147.135.210.114:54566'}, {'full_ip': '5.148.150.155:8080'}, {'full_ip': '186.227.8.21:3128'}, {'full_ip': '49.151.155.60:8080'}, {'full_ip': '52.170.255.17:80'}, {'full_ip': '51.15.35.239:3128'}, {'full_ip': '163.172.27.213:3128'}, {'full_ip': '94.137.31.214:8080'}]

答案 1 :(得分:4)

如果您尝试使用BeautifulSoup而不是正则表达式,您也可以执行类似下面的操作:

import requests
from bs4 import BeautifulSoup

res = requests.get('https://free-proxy-list.net/', headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("tbody tr"):
    proxy_list = ':'.join([item.text for item in items.select("td")[:2]])
    print(proxy_list)

部分输出:

122.183.139.109:8080
154.66.122.130:53281
110.77.183.158:42619
159.192.226.247:54214
47.89.41.164:80

答案 2 :(得分:1)

熊猫可以替代BeautifulSoup。我成功使用pandas.read_html函数抓取了free-proxy-list.net

import requests
import pandas as pd 

resp = requests.get('https://free-proxy-list.net/') 
df = pd.read_html(resp.text)[0]

结果数据帧存储在df中:

         IP Address     Port Code               Country    Anonymity Google Https    Last Checked
0      2.50.154.155  53281.0   AE  United Arab Emirates  elite proxy     no   yes   6 seconds ago
1    134.249.165.49  53281.0   UA               Ukraine  elite proxy     no   yes   6 seconds ago
2    158.58.133.106  41258.0   RU    Russian Federation  elite proxy     no   yes   6 seconds ago
3     92.52.186.123  32329.0   UA               Ukraine  elite proxy     no   yes   6 seconds ago
4     178.213.0.207  35140.0   UA               Ukraine  elite proxy     no   yes   6 seconds ago
..              ...      ...  ...                   ...          ...    ...   ...             ...
296    93.185.96.60  41003.0   CZ        Czech Republic  elite proxy     no   yes  22 minutes ago
297    1.20.103.248  52574.0   TH              Thailand  elite proxy     no   yes  22 minutes ago
298    190.210.8.92   8080.0   AR             Argentina  elite proxy     no   yes  22 minutes ago
299  166.150.32.182  56074.0   US         United States  elite proxy     no   yes  22 minutes ago
300             NaN      NaN  NaN                   NaN          NaN    NaN   NaN             NaN

[301 rows x 8 columns]

此DataFrame现在可以通过任何方式进行操作。例如,假设我只想要在美国也列出的精英代理,我可以像df[(df['Anonymity'] == 'elite proxy') & (df['Country'] == 'United States')]这样返回

         IP Address     Port Code        Country    Anonymity Google Https    Last Checked
32    138.68.53.220   5836.0   US  United States  elite proxy     no   yes   6 seconds ago
76   173.217.255.36  33351.0   US  United States  elite proxy     no    no  10 seconds ago
86    24.172.34.114  40675.0   US  United States  elite proxy     no    no  10 seconds ago
111   209.190.32.28   3128.0   US  United States  elite proxy     no   yes  10 seconds ago
150  104.148.76.176   3128.0   US  United States  elite proxy     no    no  11 minutes ago
151  104.148.76.185   3128.0   US  United States  elite proxy     no    no  11 minutes ago
168  104.148.76.136   3128.0   US  United States  elite proxy     no    no  11 minutes ago
169  104.148.76.182   3128.0   US  United States  elite proxy     no    no  11 minutes ago
182  104.148.76.183   3128.0   US  United States  elite proxy     no   yes  11 minutes ago
184      3.95.11.66   3128.0   US  United States  elite proxy     no   yes  12 minutes ago
190    63.249.67.70  53281.0   US  United States  elite proxy     no    no  12 minutes ago
288  205.201.49.141  53281.0   US  United States  elite proxy     no   yes  22 minutes ago
299  166.150.32.182  56074.0   US  United States  elite proxy     no   yes  22 minutes ago

从这里开始,获取IP地址和相关端口就像df['IP Address']df['Port']一样简单

答案 3 :(得分:0)

您可以使用Agenty chrome扩展名轻松编写/测试CSS选择器,然后使用该配置与BeautifulSoup一起运行它。这是一个示例-https://forum.agenty.com/t/how-to-scrape-free-proxy-list-from-internet/19

enter image description here

全面披露-我是该产品的开发人员。