我不知道如何处理该问题:
import requests
from lxml import html
url = "https://www.youtube.com/channel/UCfR-fBJh2962cim_1VnYTWA/about"
page = requests.get(url)
tree = html.fromstring(page.content)
social_links = tree.xpath("//div[@class='about-metadata branded-page-box-padding clearfix ']/ul/li/a/@href")
print(social_links)
Output:
['/redirect?event=channel_description&q=http%3A%2F%2Fytpoop.forumsgratuits.fr%2F&redir_token=r1CUB7VAljinDADqTCfphHrd5NZ8MTUzMjAwMDQ2NkAxNTMxOTE0MDY2',
'/redirect?event=channel_description&q=https%3A%2F%2Fhangouts.google.com%2F&redir_token=r1CUB7VAljinDADqTCfphHrd5NZ8MTUzMjAwMDQ2NkAxNTMxOTE0MDY2',
'https://plus.google.com/106805445520544523156']
是否可以使用正则表达式过滤/提取我只有完整网站的输出,对于第一个示例,它类似于“ http://ytpoop.forumsgratuits.fr”,而不是完整的重定向URL。我已经通过使用以下代码进行了尝试:
social_links = [w.replace("%3A%2F%2F", "://") for w in social_links]
social_links = [w.replace("%2F", "") for w in social_links]
#social_links = [w.replace("%2F", "/") for w in social_links]
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', str(social_links))
唯一的问题是,如果网站更加具体甚至更长,则不会提取所有URL。同样,来自用户渠道的主要信息也会因我的方法而丢失,例如,由于视频群聊,我的输出就像https://hangouts.google.com。
答案 0 :(得分:0)
编辑以获取详细信息:
以下代码示例将按此顺序循环social_links列表:
for W in social_links:
re.findall('http.*/', urllib.unquote(W).decode('utf8'))[0]
u'http://ytpoop.forumsgratuits.fr/'
u'https://hangouts.google.com/'
u'https://plus.google.com/'
您可能还想从URL获取Google+ ID
for W in social_links:
re.findall('http.*/.*', urllib.unquote(W).decode('utf8'))[0]
u'http://ytpoop.forumsgratuits.fr/'
u'https://hangouts.google.com/'
u'https://plus.google.com/106805445520544523156'