我目前正在使用python脚本来搜索网页上的选定数据。
对于上下文,它会从在线词典中查找一些单词语音,并且也会对其他一些相似的单词进行查找(类似于google transliterator所做的事情)。 问题在于,每个网页都需要完整下载,以便我提取所需的数据(不幸的是,该数据已接近网页源的结尾)。
我想知道是否可以在不下载所有数据的情况下访问网页的特定元素。
这是我当前执行此操作的代码段:
name: null
birthday: null
address: null
postcode: 0
phone: null
name: Posephine Bloggs
birthday: null
address: null
postcode: 0
phone: null
name: null
birthday: 01-06-1980
address: null
postcode: 0
phone: null
name: null
birthday: null
address: 1 Grace Street, Lane Cove, NSW
postcode: 0
phone: null
name: null
birthday: null
address: null
postcode: 0
phone: null
我要记住的是,它跳过诸如for i in SuggestionJson['suggestions']:
webpage = requests.get("https://www.vajehyab.com" + i['link'] + "&t=like") #download whole webpage
soup = BeautifulSoup(webpage.content, 'html.parser')
phonetic = soup.find("div", {"id": "wordbox"}).section.header.h3.text.replace('/','') #extract data from div
if(phonetic != ''): #save to file
f.write(phonetic)
之类的下载元素,并跳过与我想要的ID不匹配的所有其他<head>
元素。
这可能吗?
编辑:例如,说我有以下html(来自ifconfig.me)代码:
<div>
我希望脚本仅下载网页的这一部分(或至少接近目标):
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="content-style-type" content="text/css" />
<meta http-equiv="content-script-type" content="text/javascript" />
<meta http-equiv="content-language" content="en" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="cache-control" content="no-cache" />
<meta name="description" content="Get my IP Address" />
<meta name="keywords" content="ip address ifconfig ifconfig.me" />
<meta name="author" content="" />
<link rel="shortcut icon" href="favicon.ico" />
<link rel="canonical" href="https://ipinfo.io/">
<title>What Is My IP Address? - ifconfig.me</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="/styles/style.css" rel="stylesheet" type="text/css">
</head>
<body>
<div id="container" class="clearfix">
<div id="header">
<table>
<tr>
<td>
<h1><a href="http://ifconfig.me">What Is My IP Address? - ifconfig.me</a></h1>
</td>
<td></td>
</tr>
<tr>
<td></td>
<td>
<div id="plungins">
<div class="plungin" id="button_facebook">
<div id="fb-root"></div>
<script src="http://connect.facebook.net/en_US/all.js#xfbml=1"></script>
<fb:like href="http://ifconfig.me/" send="false" layout="button_count" width="100"
show_faces="true" font=""></fb:like>
</div>
<div class="plungin" id="button_twitter">
<a href="http://twitter.com/share" class="twitter-share-button"
data-url="http://ifconfig.me/" data-text="What Is My IP Address? - ifconfig.me
" data-count="horizontal"></a>
<script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script>
</div>
<div class="plungin" id="button_plusone">
<!-- Place this tag where you want the +1 button to render -->
<g:plusone size="medium" href="http://ifconfig.me/"></g:plusone>
<!-- Place this render call where appropriate -->
<script type="text/javascript">
(function () {
var po = document.createElement('script');
po.type = 'text/javascript';
po.async = true;
po.src = 'https://apis.google.com/js/plusone.js';
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(po, s);
})();
</script>
</div>
</div>
</td>
</tr>
</table>
</div>
<div id="info_area">
<h2>Your Connection</h2>
<table id="info_table" summary="info">
<tr>
<td class="info_table_label">IP Address</td>
<td id="ip_address_cell"><strong id="ip_address">2.177.115.178</strong></td>
</tr>
<tr>
<td class="info_table_label">Remote Host</td>
<td>unavailable</td>
</tr>
<tr>
<td class="info_table_label">User Agent</td>
<td>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</td>
</tr>
<tr>
<td class="info_table_label">Port</td>
<td>33966</td>
</tr>
<tr>
<td class="info_table_label">Language</td>
<td>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</td>
</tr>
<tr>
<td class="info_table_label">Referer</td>
<td></td>
</tr>
<tr>
<td class="info_table_label">Connection</td>
<td></td>
</tr>
<tr>
<td class="info_table_label">KeepAlive</td>
<td></td>
</tr>
<tr>
<td class="info_table_label">Method</td>
<td>GET</td>
</tr>
<tr>
<td class="info_table_label">Encoding</td>
<td>gzip, deflate, br</td>
</tr>
<tr>
<td class="info_table_label">MIME Type</td>
<td> text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
</td>
</tr>
<tr>
<td class="info_table_label">Charset</td>
<td></td>
</tr>
<tr>
<td class="info_table_label">Via</td>
<td>1.1 google</td>
</tr>
<tr>
<td class="info_table_label">X-Forwarded-For</td>
<td>2.177.115.178, 216.239.34.21</td>
</tr>
</table>
</div>
<!--<div id="middle"></div>-->
<div id="cli_wrap">
<h2>Command Line Interface</h2>
<table id="cli_table" summary="cli">
<tr>
<td class="cli_command">$ curl ifconfig.me</td>
<td class="cli_arrow">⇒</td>
<td>2.177.115.178</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/ip</td>
<td class="cli_arrow">⇒</td>
<td>2.177.115.178</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/host</td>
<td class="cli_arrow">⇒</td>
<td>unavailable</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/ua</td>
<td class="cli_arrow">⇒</td>
<td>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/port</td>
<td class="cli_arrow">⇒</td>
<td>33966</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/lang</td>
<td class="cli_arrow">⇒</td>
<td>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/keepalive</td>
<td class="cli_arrow">⇒</td>
<td></td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/connection</td>
<td class="cli_arrow">⇒</td>
<td></td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/encoding</td>
<td class="cli_arrow">⇒</td>
<td>gzip, deflate, br</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/mime</td>
<td class="cli_arrow">⇒</td>
<td>text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/charset</td>
<td class="cli_arrow">⇒</td>
<td></td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/via</td>
<td class="cli_arrow">⇒</td>
<td>1.1 google</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/forwarded</td>
<td class="cli_arrow">⇒</td>
<td>2.177.115.178, 216.239.34.21</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/all</td>
<td class="cli_arrow">⇒</td>
<td>
ip_addr: 2.177.115.178
<br>
remote_host: unavailable
<br>
user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36
<br>
port: 33966
<br>
language: en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6
<br>
referer:
<br>
connection:
<br>
keep_alive:
<br>
method: GET
<br>
encoding: gzip, deflate, br
<br>
mime:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
<br>
charset:
<br>
via: 1.1 google
<br>
forwarded: 2.177.115.178, 216.239.34.21
<br>
</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/all.xml</td>
<td class="cli_arrow">⇒</td>
<td><info>
<ip_addr>2.177.115.178</ip_addr>
<remote_host>unavailable</remote_host>
<user_agent>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</user_agent>
<port>33966</port>
<language>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</language>
<referer></referer>
<connection></connection>
<keep_alive></keep_alive>
<method>GET</method>
<encoding>gzip, deflate, br</encoding>
<mime>text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3</mime>
<charset></charset>
<via>1.1 google</via>
<forwarded>2.177.115.178, 216.239.34.21</forwarded>
</info></td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/all.json</td>
<td class="cli_arrow">⇒</td>
<td>{"ip_addr":"2.177.115.178","remote_host":"unavailable","user_agent":"Mozilla/5.0
(X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.131
Chrome/74.0.3729.131
Safari/537.36","port":33966,"language":"en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6","method":"GET","encoding":"gzip,
deflate,
br","mime":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","via":"1.1
google","forwarded":"2.177.115.178, 216.239.34.21"}</td>
</tr>
</table>
</div>
<div id="footer">© 2018 ifconfig.me</div>
</div>
</body>
</html>
Edit2:我正在使用的网页也不支持内容长度标题
答案 0 :(得分:1)
并非完全按照您的设想。您想指示Web服务器跳过基于标签的某些内容,尽管这在理论上是可行的,但在常规Web页面上不会发生。 (也许是某种API,但是您正在抓取常规网页。)
有一些值得您关注的地方。有一种叫做HTTP range requests的东西-而不是要求完整的文件,而是要求文件的范围。例如,如果您知道网页大约为100 KB,但是您要查找的标签位于最后3 KB中,则可以要求Web服务器仅向您发送最后3 KB。
能否正常运行取决于网络服务器和其背后的软件的设置方式。 Example with python requests。如果页面是动态生成的,则通常Web服务器将不会满足您的范围请求,而是将整页发送给您。
(如果这行得通,不确定BeautifulSoup是否可以理解您将获得的零碎HTML。但是有可能,它是非常宽容的!)