我正试图抓住一个免费代理列表网站,但是,我无法抓住代理。
以下是我的代码:
import requests
import re
url = 'https://free-proxy-list.net/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
source = requests.get(url, headers=headers, timeout=10).text
proxies = re.findall(r'([0-9]{1,3}\.){3}[0-9]{1,3}(:[0-9]{2,4})?', source)
print(proxies)
如果有人可以帮助我而不使用像BeautifulSoup这样的额外库/模块,我将非常感激。
答案 0 :(得分:5)
通常最好使用BeautifulSoup
等解析器来提取html
的额外数据,而不是正则表达式,因为很难重现BeautifulSoup
的精确度;但是,你可以用纯正则表达式来试试这个:
import re
url = 'https://free-proxy-list.net/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
source = str(requests.get(url, headers=headers, timeout=10).text https://free-proxy-list.net/)
data = [list(filter(None, i))[0] for i in re.findall('<td class="hm">(.*?)</td>|<td>(.*?)</td>', source)]
groupings = [dict(zip(['ip', 'port', 'code', 'using_anonymous'], data[i:i+4])) for i in range(0, len(data), 4)]
样本输出(实际长度为300):
[{'ip': '47.88.242.10', 'port': '80', 'code': 'SG', 'using_anonymous': 'anonymous'}, {'ip': '118.189.172.136', 'port': '80', 'code': 'SG', 'using_anonymous': 'elite proxy'}, {'ip': '147.135.210.114', 'port': '54566', 'code': 'PL', 'using_anonymous': 'anonymous'}, {'ip': '5.148.150.155', 'port': '8080', 'code': 'GB', 'using_anonymous': 'elite proxy'}, {'ip': '186.227.8.21', 'port': '3128', 'code': 'BR', 'using_anonymous': 'anonymous'}, {'ip': '49.151.155.60', 'port': '8080', 'code': 'PH', 'using_anonymous': 'anonymous'}, {'ip': '52.170.255.17', 'port': '80', 'code': 'US', 'using_anonymous': 'anonymous'}, {'ip': '51.15.35.239', 'port': '3128', 'code': 'NL', 'using_anonymous': 'elite proxy'}, {'ip': '163.172.27.213', 'port': '3128', 'code': 'GB', 'using_anonymous': 'elite proxy'}, {'ip': '94.137.31.214', 'port': '8080', 'code': 'RU', 'using_anonymous': 'anonymous'}]
编辑:连接ip和端口,迭代每个分组并使用字符串格式:
final_groupings = [{'full_ip':"{ip}:{port}".format(**i)} for i in groupings]
输出:
[{'full_ip': '47.88.242.10:80'}, {'full_ip': '118.189.172.136:80'}, {'full_ip': '147.135.210.114:54566'}, {'full_ip': '5.148.150.155:8080'}, {'full_ip': '186.227.8.21:3128'}, {'full_ip': '49.151.155.60:8080'}, {'full_ip': '52.170.255.17:80'}, {'full_ip': '51.15.35.239:3128'}, {'full_ip': '163.172.27.213:3128'}, {'full_ip': '94.137.31.214:8080'}]
答案 1 :(得分:4)
如果您尝试使用BeautifulSoup而不是正则表达式,您也可以执行类似下面的操作:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://free-proxy-list.net/', headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("tbody tr"):
proxy_list = ':'.join([item.text for item in items.select("td")[:2]])
print(proxy_list)
部分输出:
122.183.139.109:8080
154.66.122.130:53281
110.77.183.158:42619
159.192.226.247:54214
47.89.41.164:80
答案 2 :(得分:1)
熊猫可以替代BeautifulSoup。我成功使用pandas.read_html函数抓取了free-proxy-list.net
import requests
import pandas as pd
resp = requests.get('https://free-proxy-list.net/')
df = pd.read_html(resp.text)[0]
结果数据帧存储在df中:
IP Address Port Code Country Anonymity Google Https Last Checked
0 2.50.154.155 53281.0 AE United Arab Emirates elite proxy no yes 6 seconds ago
1 134.249.165.49 53281.0 UA Ukraine elite proxy no yes 6 seconds ago
2 158.58.133.106 41258.0 RU Russian Federation elite proxy no yes 6 seconds ago
3 92.52.186.123 32329.0 UA Ukraine elite proxy no yes 6 seconds ago
4 178.213.0.207 35140.0 UA Ukraine elite proxy no yes 6 seconds ago
.. ... ... ... ... ... ... ... ...
296 93.185.96.60 41003.0 CZ Czech Republic elite proxy no yes 22 minutes ago
297 1.20.103.248 52574.0 TH Thailand elite proxy no yes 22 minutes ago
298 190.210.8.92 8080.0 AR Argentina elite proxy no yes 22 minutes ago
299 166.150.32.182 56074.0 US United States elite proxy no yes 22 minutes ago
300 NaN NaN NaN NaN NaN NaN NaN NaN
[301 rows x 8 columns]
此DataFrame现在可以通过任何方式进行操作。例如,假设我只想要在美国也列出的精英代理,我可以像df[(df['Anonymity'] == 'elite proxy') & (df['Country'] == 'United States')]
这样返回
IP Address Port Code Country Anonymity Google Https Last Checked
32 138.68.53.220 5836.0 US United States elite proxy no yes 6 seconds ago
76 173.217.255.36 33351.0 US United States elite proxy no no 10 seconds ago
86 24.172.34.114 40675.0 US United States elite proxy no no 10 seconds ago
111 209.190.32.28 3128.0 US United States elite proxy no yes 10 seconds ago
150 104.148.76.176 3128.0 US United States elite proxy no no 11 minutes ago
151 104.148.76.185 3128.0 US United States elite proxy no no 11 minutes ago
168 104.148.76.136 3128.0 US United States elite proxy no no 11 minutes ago
169 104.148.76.182 3128.0 US United States elite proxy no no 11 minutes ago
182 104.148.76.183 3128.0 US United States elite proxy no yes 11 minutes ago
184 3.95.11.66 3128.0 US United States elite proxy no yes 12 minutes ago
190 63.249.67.70 53281.0 US United States elite proxy no no 12 minutes ago
288 205.201.49.141 53281.0 US United States elite proxy no yes 22 minutes ago
299 166.150.32.182 56074.0 US United States elite proxy no yes 22 minutes ago
从这里开始,获取IP地址和相关端口就像df['IP Address']
和df['Port']
一样简单
答案 3 :(得分:0)
您可以使用Agenty chrome扩展名轻松编写/测试CSS选择器,然后使用该配置与BeautifulSoup一起运行它。这是一个示例-https://forum.agenty.com/t/how-to-scrape-free-proxy-list-from-internet/19
全面披露-我是该产品的开发人员。