我有以下Python代码:
content = webpage.content
soup = Soup(content, 'html.parser')
app_url = scheme + app_identity.get_default_version_hostname() + '/'
for link in soup.find_all(href = True):
if scheme in link['href']:
link['href'] = link['href'].replace(scheme, app_url)
logging.info('@MirrorPage | Updated link: %s', link['href'])
else:
link['href'] = input_url + link['href'].strip('/')
logging.info('@MirrorPage | Updated asset: %s', link['href'])
# https://stackoverflow.com/questions/15455148/find-after-replacewith-doesnt-work-using-beautifulsoup/19612218#19612218
#soup = Soup(soup.renderContents())
# https://stackoverflow.com/questions/14369447/how-to-save-back-changes-made-to-a-html-file-using-beautifulsoup-in-python
content = soup.prettify(soup.original_encoding)
并像这样呈现我的HTML:
self.response.write(Environment().from_string(unicode(content, errors = 'ignore')).render())
app_identity
来自Google App Engine,jinja2
用于模板/渲染。我已尽力将修改后的HTML写回content
变量,以便呈现正确的网页。如何正确编写我所做的任何更改?我试图在适当的地方使用replaceWith
,但这似乎没有办法。我做了什么从根本上错了吗?
答案 0 :(得分:0)
此函数使用保存html并将其返回以根据需要进行重新处理。
我在stackoverflow上测试了它,它用替换的链接/方案保存了html。
我使用{{description}}
作为template.html
它将打开的html作为变量返回,然后传回bs4对象并打印出来。
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
from xml.sax.saxutils import escape
import os
import jinja2
import requests
from bs4 import BeautifulSoup as bs4
def revise_links():
url = 'https://stackoverflow.com/'
template_name = 'template.html'
file_name = 'replaced'
scheme = 'stackoverflow'
replace_with = 'mysite'
r = requests.get(url)
html_bytes = r.text
soup = bs4(html_bytes, 'lxml')
description_source = soup.findAll()
for a in soup.findAll(href=True):
if scheme in a['href']:
a['href'] = a['href'].replace(scheme, replace_with)
print a['href']
else:
a['href'] = url + a['href'].strip('/')
# RENDER THE NEW HTML FILE *
def render(tpl_path, context):
"""Render html file with new data. Looks for the file in the current path"""
(path, filename) = os.path.split(tpl_path)
return jinja2.Environment(loader=jinja2.FileSystemLoader(path or './')).get_template(filename).render(context)
# HTML DATA
context = {'description': description_source}
# Render the result
result = render(template_name, context)
# open the html
# with open(file_name + '.html', 'a', encoding='utf-8') as f:
# f.write(result) # write result
# OPEN THE NEW HTML FILE READY TO REVISE **********************
# f1 = open(file_name + '.html', 'r', encoding='utf-8')
# descript = f1.read()
return result
content = revise_links()
soup = bs4(content, 'lxml')
print soup
答案 1 :(得分:0)
更改Google App Project上IMAP首选项下服务帐户的权限,修改了写入更改。但是,基本HTML不会呈现整个页面,即在呈现像Google这样的网站时,Javascript和样式似乎不起作用。我可以简单地使用self.response.write(汤)来渲染HTML,但它并没有解决这个问题。我将在一个单独的问题中解决此问题,因为它涉及实际检索(或抓取)指定的网站。