Question

我正在研究通过html代码尝试刮取域的刮板。但是，我很难拿出一段代码来匹配tor域。

Tor域通常采用以下格式：

或

我只想匹配页面中包含的网址，格式为http://sitetexthere.onion或https://sitehereitis.onion。这是一堆可能不是网址的文本。它应该只是拉出网址。

我敢肯定有一个简单或不错的正则表达式可以做到这一点，但我一直找不到。如果有人能够链接一个或快速旋转一个，那将不胜感激。非常感谢。

    session = requests.session()
    session.proxies = {}
    session.proxies['http'] = 'socks5h://localhost:9050'
    session.proxies['https'] = 'socks5h://localhost:9050'
    r = session.get('http://facebookcorewwwi.onion')
    print(r.text)

Answer 1

如果URL不匹配，regex.match将返回None。

import re

regex = re.compile(r"^https?\:\/\/[\w\-\.]+\.onion")

url = 'https://sitegoes-here.onion'

if regex.match(url):
  print('Valid Tor Domain!')
else:
  print('Invalid Tor Domain!')

对于可选的http：

regex = re.compile(r"^(?:https?\:\/\/)?[\w\-\.]+\.onion")

Answer 2

正则表达式模式大部分是标准模式，因此，我向您推荐这种模式：

'。onion $'

反斜杠转义点，“ $”字符表示字符串的结尾。由于所有网址均以“ http（s）：//”开头，因此无需将其包含在模式中。

Answer 3

假设这些取自href属性，则可以尝试使用$运算符结尾的attribute =值选择器

from bs4 import BeautifulSoup as bs
import requests

resp = requests.get("https://en.wikipedia.org/wiki/Tor_(anonymity_network)")  #example url. Replace with yours.
soup = bs(resp.text,'lxml')
links = [item['href'] for item in soup.select('[href$=".onion"]')]

正则表达式以识别Tor域

3 个答案: