仅下载网页的一部分;不论网页来源中的位置

时间:2019-05-07 07:45:03

标签: python web-scraping beautifulsoup

我目前正在使用python脚本来搜索网页上的选定数据。

对于上下文,它会从在线词典中查找一些单词语音,并且也会对其他一些相似的单词进行查找(类似于google transliterator所做的事情)。 问题在于,每个网页都需要完整下载,以便我提取所需的数据(不幸的是,该数据已接近网页源的结尾)。

我想知道是否可以在不下载所有数据的情况下访问网页的特定元素。

这是我当前执行此操作的代码段:

name: null
birthday: null
address: null
postcode: 0
phone: null

name:  Posephine Bloggs
birthday: null
address: null
postcode: 0
phone: null

name: null
birthday: 01-06-1980
address: null
postcode: 0
phone: null

name: null
birthday: null
address:  1 Grace Street, Lane Cove, NSW
postcode: 0
phone: null

name: null
birthday: null
address: null
postcode: 0
phone: null

我要记住的是,它跳过诸如for i in SuggestionJson['suggestions']: webpage = requests.get("https://www.vajehyab.com" + i['link'] + "&t=like") #download whole webpage soup = BeautifulSoup(webpage.content, 'html.parser') phonetic = soup.find("div", {"id": "wordbox"}).section.header.h3.text.replace('/','') #extract data from div if(phonetic != ''): #save to file f.write(phonetic) 之类的下载元素,并跳过与我想要的ID不匹配的所有其他<head>元素。 这可能吗?

编辑:例如,说我有以下html(来自ifconfig.me)代码:

<div>

我希望脚本仅下载网页的这一部分(或至少接近目标):

<!DOCTYPE html>
<html lang="en">

<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta http-equiv="content-style-type" content="text/css" />
    <meta http-equiv="content-script-type" content="text/javascript" />
    <meta http-equiv="content-language" content="en" />
    <meta http-equiv="pragma" content="no-cache" />
    <meta http-equiv="cache-control" content="no-cache" />
    <meta name="description" content="Get my IP Address" />
    <meta name="keywords" content="ip address ifconfig ifconfig.me" />
    <meta name="author" content="" />
    <link rel="shortcut icon" href="favicon.ico" />
    <link rel="canonical" href="https://ipinfo.io/">
    <title>What Is My IP Address? - ifconfig.me</title>
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link href="/styles/style.css" rel="stylesheet" type="text/css">
</head>

<body>
    <div id="container" class="clearfix">
        <div id="header">
            <table>
                <tr>
                    <td>
                        <h1><a href="http://ifconfig.me">What Is My IP Address? - ifconfig.me</a></h1>
                    </td>
                    <td></td>
                </tr>
                <tr>
                    <td></td>
                    <td>
                        <div id="plungins">
                            <div class="plungin" id="button_facebook">
                                <div id="fb-root"></div>
                                <script src="http://connect.facebook.net/en_US/all.js#xfbml=1"></script>
                                <fb:like href="http://ifconfig.me/" send="false" layout="button_count" width="100"
                                    show_faces="true" font=""></fb:like>
                            </div>

                            <div class="plungin" id="button_twitter">
                                <a href="http://twitter.com/share" class="twitter-share-button"
                                    data-url="http://ifconfig.me/" data-text="What Is My IP Address? - ifconfig.me
           " data-count="horizontal"></a>
                                <script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script>
                            </div>

                            <div class="plungin" id="button_plusone">
                                <!-- Place this tag where you want the +1 button to render -->
                                <g:plusone size="medium" href="http://ifconfig.me/"></g:plusone>
                                <!-- Place this render call where appropriate -->
                                <script type="text/javascript">
                                    (function () {
                                        var po = document.createElement('script');
                                        po.type = 'text/javascript';
                                        po.async = true;
                                        po.src = 'https://apis.google.com/js/plusone.js';
                                        var s = document.getElementsByTagName('script')[0];
                                        s.parentNode.insertBefore(po, s);
                                    })();
                                </script>
                            </div>
                        </div>
                    </td>
                </tr>
            </table>
        </div>
        <div id="info_area">
            <h2>Your Connection</h2>
            <table id="info_table" summary="info">
                <tr>
                    <td class="info_table_label">IP Address</td>
                    <td id="ip_address_cell"><strong id="ip_address">2.177.115.178</strong></td>
                </tr>
                <tr>
                    <td class="info_table_label">Remote Host</td>
                    <td>unavailable</td>
                </tr>
                <tr>
                    <td class="info_table_label">User Agent</td>
                    <td>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
                        Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</td>
                </tr>
                <tr>
                    <td class="info_table_label">Port</td>
                    <td>33966</td>
                </tr>
                <tr>
                    <td class="info_table_label">Language</td>
                    <td>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</td>
                </tr>
                <tr>
                    <td class="info_table_label">Referer</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="info_table_label">Connection</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="info_table_label">KeepAlive</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="info_table_label">Method</td>
                    <td>GET</td>
                </tr>
                <tr>
                    <td class="info_table_label">Encoding</td>
                    <td>gzip, deflate, br</td>
                </tr>
                <tr>
                    <td class="info_table_label">MIME Type</td>
                    <td> text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
                    </td>
                </tr>
                <tr>
                    <td class="info_table_label">Charset</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="info_table_label">Via</td>
                    <td>1.1 google</td>
                </tr>
                <tr>
                    <td class="info_table_label">X-Forwarded-For</td>
                    <td>2.177.115.178, 216.239.34.21</td>
                </tr>
            </table>
        </div>
        <!--<div id="middle"></div>-->
        <div id="cli_wrap">
            <h2>Command Line Interface</h2>
            <table id="cli_table" summary="cli">
                <tr>
                    <td class="cli_command">$ curl ifconfig.me</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>2.177.115.178</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/ip</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>2.177.115.178</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/host</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>unavailable</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/ua</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
                        Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/port</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>33966</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/lang</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/keepalive</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/connection</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/encoding</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>gzip, deflate, br</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/mime</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
                    </td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/charset</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/via</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>1.1 google</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/forwarded</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>2.177.115.178, 216.239.34.21</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/all</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>

                        ip_addr: 2.177.115.178
                        <br>

                        remote_host: unavailable
                        <br>

                        user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
                        Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36
                        <br>

                        port: 33966
                        <br>

                        language: en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6
                        <br>

                        referer:
                        <br>

                        connection:
                        <br>

                        keep_alive:
                        <br>

                        method: GET
                        <br>

                        encoding: gzip, deflate, br
                        <br>

                        mime:
                        text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
                        <br>

                        charset:
                        <br>

                        via: 1.1 google
                        <br>

                        forwarded: 2.177.115.178, 216.239.34.21
                        <br>

                    </td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/all.xml</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>&lt;info&gt;
                        &lt;ip_addr&gt;2.177.115.178&lt;/ip_addr&gt;
                        &lt;remote_host&gt;unavailable&lt;/remote_host&gt;
                        &lt;user_agent&gt;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
                        Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36&lt;/user_agent&gt;
                        &lt;port&gt;33966&lt;/port&gt;
                        &lt;language&gt;en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6&lt;/language&gt;
                        &lt;referer&gt;&lt;/referer&gt;
                        &lt;connection&gt;&lt;/connection&gt;
                        &lt;keep_alive&gt;&lt;/keep_alive&gt;
                        &lt;method&gt;GET&lt;/method&gt;
                        &lt;encoding&gt;gzip, deflate, br&lt;/encoding&gt;
                        &lt;mime&gt;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3&lt;/mime&gt;
                        &lt;charset&gt;&lt;/charset&gt;
                        &lt;via&gt;1.1 google&lt;/via&gt;
                        &lt;forwarded&gt;2.177.115.178, 216.239.34.21&lt;/forwarded&gt;
                        &lt;/info&gt;</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/all.json</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>{&quot;ip_addr&quot;:&quot;2.177.115.178&quot;,&quot;remote_host&quot;:&quot;unavailable&quot;,&quot;user_agent&quot;:&quot;Mozilla/5.0
                        (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.131
                        Chrome/74.0.3729.131
                        Safari/537.36&quot;,&quot;port&quot;:33966,&quot;language&quot;:&quot;en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6&quot;,&quot;method&quot;:&quot;GET&quot;,&quot;encoding&quot;:&quot;gzip,
                        deflate,
                        br&quot;,&quot;mime&quot;:&quot;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3&quot;,&quot;via&quot;:&quot;1.1
                        google&quot;,&quot;forwarded&quot;:&quot;2.177.115.178, 216.239.34.21&quot;}</td>
                </tr>
            </table>
        </div>
        <div id="footer">&copy; 2018 ifconfig.me</div>
    </div>
</body>

</html>

Edit2:我正在使用的网页也不支持内容长度标题

1 个答案:

答案 0 :(得分:1)

并非完全按照您的设想。您想指示Web服务器跳过基于标签的某些内容,尽管这在理论上是可行的,但在常规Web页面上不会发生。 (也许是某种API,但是您正在抓取常规网页。)

有一些值得您关注的地方。有一种叫做HTTP range requests的东西-而不是要求完整的文件,而是要求文件的范围。例如,如果您知道网页大约为100 KB,但是您要查找的标签位于最后3 KB中,则可以要求Web服务器仅向您发送最后3 KB。

能否正常运行取决于网络服务器和其背后的软件的设置方式。 Example with python requests。如果页面是动态生成的,则通常Web服务器将不会满足您的范围请求,而是将整页发送给您。

(如果这行得通,不确定BeautifulSoup是否可以理解您将获得的零碎HTML。但是有可能,它是非常宽容的!)