我想知道如何使用它的URL提取外部网站的Title
和Metadescription
。我找到了一些解决方案,但没有找到django / python。
目前我的代码添加了一个指向数据库的链接,我希望在添加该链接后转到该链接,然后使用相应的Title
和Metadescription
更新该条目。
能够检索og
等meta property="og:url
标记也很不错。
谢谢。
答案 0 :(得分:3)
要访问外部网站的标题或说明,您必须做两件事。
1)您需要获取html外部网站。 2)您需要解析html并获取title元素和元元素。
第一部分很简单:
import urllib2
opener = urllib2.build_opener()
external_sites_html = opener.open(external_sites_url).read()
第二部分更难,因为我们需要使用外部库来解析html,我喜欢一个名为BeautifulSoup的库,因为它有一个非常好的api。 (程序员很容易使用。)
from bs4 import BeautifulSoup
soup = BeautifulSoup(external_sites_html)
# Now we can get the tags of the external site from the soup variable.
title = soup.title.string
但是,重要的是要记住外部网站在我们获取它时只能缓慢响应,因此在数据库中创建外部网站记录,然后向用户返回回复可能是明智的。然后在其他一些过程中,您应该去获取URL并将额外信息添加到数据库中。如果在回复中返回额外信息很重要,那么您无法在后台执行此操作,并且必须让您的用户等待。
答案 1 :(得分:1)
我得到@ryan-pergent的答案并得到了改善,metadata.py
:
import re
import subprocess
from subprocess import TimeoutExpired
from bs4 import BeautifulSoup, Comment
from urllib.parse import urljoin
class Metadata:
url = ""
type = "" # https://ogp.me/#types
title = ""
description = ""
image = ""
def __str__(self):
return "{url: " + self.url + ", type: " + self.type + ", title: " + self.title + ", description: " + self.description + ", image: " + self.image + "}"
class Metadatareader:
@staticmethod
def get_metadata_from_url_in_text(text):
# look for the first url in the text
# and extract the url metadata
urls_in_text = Metadatareader.get_urls_from_text(text)
if len(urls_in_text) > 0:
return Metadatareader.get_url_metadata(urls_in_text[0])
return Metadata()
@staticmethod
def get_urls_from_text(text):
# look for all urls in text
# and convert it to an array of urls
regex = r"(?:(?:https?|ftp):\/\/|\b(?:[a-z\d]+\.))(?:(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))?\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))?"
return re.findall(regex, text)
@staticmethod
def get_url_metadata(url):
# get final url after all redirections
# then get html of the final url
# fill the meta data with the info available
url = Metadatareader.get_final_url(url)
url_content = Metadatareader.get_url_content(url)
soup = BeautifulSoup(url_content, "html.parser")
metadata = Metadata()
metadata.url = url
metadata.type = "website"
for meta in soup.findAll("meta"):
# priorize using Open Graph Protocol
# https://ogp.me/
metadata.type = Metadatareader.get_meta_property(meta, "og:type", metadata.type)
metadata.title = Metadatareader.get_meta_property(meta, "og:title", metadata.title)
metadata.description = Metadatareader.get_meta_property(meta, "og:description", metadata.description)
metadata.image = Metadatareader.get_meta_property(meta, "og:image", metadata.image)
if metadata.image:
metadata.image = urljoin(url, metadata.image)
if not metadata.title and soup.title:
# use page title
metadata.title = soup.title.text
if not metadata.image:
# use first img element
images = soup.find_all('img')
if len(images) > 0:
metadata.image = urljoin(url, images[0].get('src'))
if not metadata.description and soup.body:
# use text from body
for text in soup.body.find_all(string=True):
if text.parent.name != 'script' and text.parent.name != 'style' and not isinstance(text, Comment):
metadata.description += text
if metadata.description:
# remove white spaces and break lines
metadata.description = re.sub('\n|\r|\t', ' ', metadata.description)
metadata.description = re.sub(' +', ' ', metadata.description)
metadata.description = metadata.description.strip()
return metadata
@staticmethod
def get_final_url(url, timeout=5):
# get final url after all redirections
# get http response header
# look for the "Location: " header
proc = subprocess.Popen([
"curl",
"-Ls",#follow redirect 301 and silently
"-I",#dont download html body
url
], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
try:
out, err = proc.communicate(timeout=timeout)
except TimeoutExpired:
proc.kill()
out, err = proc.communicate()
header = str(out).split("\\r\\n")
for line in header:
if line.startswith("Location: "):
return line.replace("Location: ", "")
return url
@staticmethod
def get_url_content(url, timeout=5):
# get url html
proc = subprocess.Popen([
"curl",
"-i",
"-k",#ignore ssl certificate requisite
"-L",#follow redirect 301
url
], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
try:
out, err = proc.communicate(timeout=timeout)
except TimeoutExpired:
proc.kill()
out, err = proc.communicate()
return out
@staticmethod
def get_meta_property(meta, property_name, default_value=""):
if 'property' in meta.attrs and meta.attrs['property'] == property_name:
return meta.attrs['content']
return default_value
这是我的用法:
from metadatareader import Metadata, Metadatareader
content = "YOUR TEXT CONTAING URLS GOES HERE, LIKE google.com"
metadata = Metadatareader.get_metadata_from_url_in_text(content)
print(metadata)
答案 2 :(得分:0)
您是否在询问从外部网页中提取标题和元标记的问题?我是机械化和BeautifulSoup的粉丝。提取标题的一个例子如下。
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
def get_title(url):
br = Browser()
r = br.open(url)
soup = BeautifulSoup(r)
return soup.find("title").text
获取元标记,我会使用
的内容for meta in soup.findAll("meta"):
print (meta['name'], meta['content'])
当然你可能想要做一些其他事情而不是打印它们。
答案 3 :(得分:0)
我是这样做的:
matches1 = matches[matches.age < 21]
.groupby(['id'])['name'].agg({'result':', '.join, 'new_col': len})
print (matches1)
new_col result
id
1 1 a
2 2 c, d
print (matches.join(matches1, on='id'))
id name age new_col result
0 1 a 19 1 a
1 1 b 25 1 a
2 2 c 19 2 c, d
3 2 d 18 2 c, d
如果没有带og:title,og:description或og:image:)的元数据,请随意处理您自己的默认值
有关BeautifulSoup的更多信息: https://www.crummy.com/software/BeautifulSoup/bs4/doc/