在这种情况下,我需要将网页的源代码保存为html文件。但是如果你看一下网页,有很多部分,我不需要它们,我只想保存文章本身的源代码。
代码:
from urllib.request import urlopen
page = urlopen('http://www.abcde.com')
page_content = page.read()
with open('page_content.html', 'wb') as f:
f.write(page_content)
我可以从我的代码中保存整个源代码,但是如何才能保存我想要的唯一部分?
说明:
<div itemscope itemtype="http://schema.org/MedicalWebPage">
.
.
.
</div>
我需要在此标记中保存源代码,而不是在标记中提取句子。
我想要的结果就是像这样保存:
<div itemscope itemtype="http://schema.org/MedicalWebPage">
<div class="col-md-12 col-xs-12" style="padding-left:10px;">
<h1 itemprop="name" class="page_article_title" title="Apple" id="mask">Apple</h1>
</div>
<!--Article Start-->
<section class="page_article_div" id="print">
<article itemprop="text" class="page_article_content">
<p>
<img alt="Apple" src="http://www.abcde.com/383741719.jpg" style="width: 300px; height: 200px;" /></p>
<p>
The apple tree (Malus pumila, commonly and erroneously called Malus domestica) is a deciduous tree in the rose family best known for its sweet, pomaceous fruit, the apple.</p>
<p>
It is cultivated worldwide as a fruit tree, and is the most widely grown species in the genus Malus.</p>
<p>
<strong><span style="color: #884499;">Appe is red</span></strong></p>
<ol>
<li>
Germanic paganism</li>
<li>
Greek mythology</li>
</ol>
<p style="text-align: right;">
【Jane】</p>
<p style="text-align: right;">
Credit : Wiki</p>
</article>
<div style="text-align:right;font-size:1.2em;"><a class="authorlink" href="http://www.abcde.com/web/online;url=http://61.66.117.1234/name=2017">2017</a></div>
<br />
<div style="text-align:right;font-size:1.2em;">【Thank you!】</div>
</section>
<!--Article End-->
</div>
答案 0 :(得分:1)
我自己的解决方案:
page = urlopen('http://www.abcde.com')
page_content = page.read()
soup = BeautifulSoup(page_content, "lxml")
list = []
for tag in soup.select('div[itemtype="http://schema.org/MedicalWebPage"]'):
list.append(str(tag))
list2= (', '.join(list))
#print(list2)
#print(type(list2))
with open('C:/html/try.html', 'w',encoding='UTF-8') as f:
f.write(list2)
我是一名初学者,所以我想尽可能简单地做到这一点,这是我的回答,目前它的表现相当不错:)
答案 1 :(得分:0)
您可以使用带有标签属性的标签进行搜索,例如类或标签名称或ID,并将其保存为您想要的格式,如下例所示。
driver = BeautifulSoup(yoursavedfile.read(), 'html.parser')
tag_for_me = driver.find_elements_by_class_name('class_name_of_your_tag')
print tag_for_me
tag_for_me将包含您所需的代码。
答案 2 :(得分:0)
您可以使用Beautiful Soup获取所需的任何HTML源代码。
import requests
from bs4 import BeautifulSoup
target_class = "gb4"
target_text = "Web History"
r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "lxml")
for elem in soup.find_all(attrs={"class":target_class}):
if elem.text == target_text:
print(elem)
输出:
<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>
答案 3 :(得分:0)
使用BeautifulSoup获取要插入的HTML,获取要插入的HTML。使用insert()生成new_tag。覆盖原始文件。
from bs4 import BeautifulSoup
import requests
#Use beautiful soup to get the place you want to insert.
# div_tag is extracted div
soup = BeautifulSoup("Your content here",'lxml')
div_tag = soup.find('div',attrs={'class':'id=itemscope'})
#e.g
#div_tag = <div id=itemscope itemtype="http://schema.org/MedicalWebPage">
</div>
res = requests.get('url to get content from')
soup1 = BeautifulSoup(res.text,'lxml')
insert_data = soup1.find('your div/data to insert')
#this will insert the tag to div_tag. You can overwrite this to your original page_content.html.
div_tag.insert(3,insert_data)
#div tag contains you desired output. Overwrite it to original file.