Question

我不知道如何处理该问题：

import requests
from lxml import html

url = "https://www.youtube.com/channel/UCfR-fBJh2962cim_1VnYTWA/about"
page = requests.get(url)
tree = html.fromstring(page.content)
social_links = tree.xpath("//div[@class='about-metadata branded-page-box-padding clearfix ']/ul/li/a/@href")
print(social_links)


Output:
['/redirect?event=channel_description&q=http%3A%2F%2Fytpoop.forumsgratuits.fr%2F&redir_token=r1CUB7VAljinDADqTCfphHrd5NZ8MTUzMjAwMDQ2NkAxNTMxOTE0MDY2', 
'/redirect?event=channel_description&q=https%3A%2F%2Fhangouts.google.com%2F&redir_token=r1CUB7VAljinDADqTCfphHrd5NZ8MTUzMjAwMDQ2NkAxNTMxOTE0MDY2', 
'https://plus.google.com/106805445520544523156']

是否可以使用正则表达式过滤/提取我只有完整网站的输出，对于第一个示例，它类似于“ http://ytpoop.forumsgratuits.fr”，而不是完整的重定向URL。我已经通过使用以下代码进行了尝试：

social_links = [w.replace("%3A%2F%2F", "://") for w in social_links]
social_links = [w.replace("%2F", "") for w in social_links]
#social_links = [w.replace("%2F", "/") for w in social_links]
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', str(social_links))

唯一的问题是，如果网站更加具体甚至更长，则不会提取所有URL。同样，来自用户渠道的主要信息也会因我的方法而丢失，例如，由于视频群聊，我的输出就像https://hangouts.google.com。

Answer 1

编辑以获取详细信息：

以下代码示例将按此顺序循环social_links列表：

URL解码每个项目
在每个项目上使用正则表达式查找域URL。

代码



    for W in social_links:
      re.findall('http.*/',  urllib.unquote(W).decode('utf8'))[0]

结果



    u'http://ytpoop.forumsgratuits.fr/'
    u'https://hangouts.google.com/'
    u'https://plus.google.com/'

您可能还想从URL获取Google+ ID

代码



    for W in social_links:
      re.findall('http.*/.*',  urllib.unquote(W).decode('utf8'))[0]

结果



    u'http://ytpoop.forumsgratuits.fr/'
    u'https://hangouts.google.com/'
    u'https://plus.google.com/106805445520544523156'

提取重定向超链接

1 个答案:

代码

结果

代码

结果