Python:抓取部分源代码并将其保存为html

时间:2017-10-23 05:32:59

标签: python html urllib

在这种情况下,我需要将网页的源代码保存为html文件。但是如果你看一下网页,有很多部分,我不需要它们,我只想保存文章本身的源代码。

代码:

from urllib.request import urlopen

page = urlopen('http://www.abcde.com')
page_content = page.read()

with open('page_content.html', 'wb') as f:
    f.write(page_content)

我可以从我的代码中保存整个源代码,但是如何才能保存我想要的唯一部分?

说明:

<div itemscope itemtype="http://schema.org/MedicalWebPage">
.
.
.
</div>

我需要在此标记中保存源代码,而不是在标记中提取句子。

我想要的结果就是像这样保存:

<div itemscope itemtype="http://schema.org/MedicalWebPage">

                    <div class="col-md-12 col-xs-12" style="padding-left:10px;">
                        <h1 itemprop="name" class="page_article_title" title="Apple" id="mask">Apple</h1>
                    </div>
                    <!--Article Start-->
                    <section class="page_article_div" id="print">
                        <article itemprop="text" class="page_article_content">
<p>
    <img alt="Apple" src="http://www.abcde.com/383741719.jpg" style="width: 300px; height: 200px;" /></p>
<p>
    The apple tree (Malus pumila, commonly and erroneously called Malus domestica) is a deciduous tree in the rose family best known for its sweet, pomaceous fruit, the apple.</p>
<p>
    It is cultivated worldwide as a fruit tree, and is the most widely grown species in the genus Malus.</p>
<p>
    <strong><span style="color: #884499;">Appe is red</span></strong></p>
<ol>
    <li>
        Germanic paganism</li>
    <li>
        Greek mythology</li>
</ol>
<p style="text-align: right;">
    【Jane】</p>
<p style="text-align: right;">
    Credit : Wiki</p>

                        </article>
                            <div style="text-align:right;font-size:1.2em;"><a class="authorlink" href="http://www.abcde.com/web/online;url=http://61.66.117.1234/name=2017">2017</a></div>
                        <br />                  
                        <div style="text-align:right;font-size:1.2em;">【Thank you!】</div>
                    </section>
                    <!--Article End-->
</div>

4 个答案:

答案 0 :(得分:1)

我自己的解决方案:

page = urlopen('http://www.abcde.com')
page_content = page.read()
soup = BeautifulSoup(page_content, "lxml")
list = []
for tag in soup.select('div[itemtype="http://schema.org/MedicalWebPage"]'):
    list.append(str(tag))
list2= (', '.join(list))
#print(list2)        
#print(type(list2)) 
with open('C:/html/try.html', 'w',encoding='UTF-8') as f:
    f.write(list2)

我是一名初学者,所以我想尽可能简单地做到这一点,这是我的回答,目前它的表现相当不错:)

答案 1 :(得分:0)

您可以使用带有标签属性的标签进行搜索,例如类或标签名称或ID,并将其保存为您想要的格式,如下例所示。

driver = BeautifulSoup(yoursavedfile.read(), 'html.parser')
tag_for_me = driver.find_elements_by_class_name('class_name_of_your_tag')
print tag_for_me

tag_for_me将包含您所需的代码。

答案 2 :(得分:0)

您可以使用Beautiful Soup获取所需的任何HTML源代码。

import requests
from bs4 import BeautifulSoup

target_class = "gb4"
target_text = "Web History"
r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "lxml")

for elem in soup.find_all(attrs={"class":target_class}):
    if elem.text == target_text:
        print(elem)

输出:

<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>

答案 3 :(得分:0)

使用BeautifulSoup获取要插入的HTML,获取要插入的HTML。使用insert()生成new_tag。覆盖原始文件。

from bs4 import BeautifulSoup
import requests

#Use beautiful soup to get the place you want to insert.
# div_tag is extracted div
soup = BeautifulSoup("Your content here",'lxml')
div_tag = soup.find('div',attrs={'class':'id=itemscope'})
#e.g 
#div_tag = <div id=itemscope itemtype="http://schema.org/MedicalWebPage">
</div>


res = requests.get('url to get content from')
soup1 = BeautifulSoup(res.text,'lxml')
insert_data = soup1.find('your div/data to insert')
#this will insert the tag to div_tag. You can overwrite this to your original page_content.html.
div_tag.insert(3,insert_data)
#div tag contains you desired output. Overwrite it to original file.