无法解析网页中的任何数据

时间:2017-06-02 11:06:32

标签: python-3.x web-scraping

运行我的脚本我既没有结果,也没有任何错误。我试图解析该页面中的名称或任何易于使用的名称。无法找出我的脚本有什么问题或我应该如何处理它。将粘贴到存储文档的元素下方。这是我尝试过的:

import requests
from lxml import html

url="http://minitransat.geovoile.com/2015/"
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'}

def scrape_info(main_link):
    response = requests.get(main_link,headers=headers).text
    tree = html.fromstring(response)
    for item in tree.xpath("//div[starts-with(@id,'line')]/text()"):
        print(item)

scrape_info(url)

以下是要素:

<div id="line630" class="ARV" alt="1" rel="ffffff|4809850,-431719;3510,-301;-3430,810;2110,-3930;-140,-4500;-1320,-5390;-580,-5160;-810,-6650;-490,-6490;-1710,-7440;-3360,-5570;-3760,-5490;-2860,-5830;-2660,-6510;-2400,-5860;-2030,-5440;-1630,-5360;-990,-5850;-690,-5870;420,-5390;10,-300;330,-5040;570,-6740;300,-2250;640,-2450;-1990,-770;350,-2390;480,-2690;50,-150;760,-3020;1140,-2800;600,-2250;1580,-2620;1150,-1740;1650,-1820;1560,-2030;840,-370;270,120;-510,330;-700,-650;-20,-70;-480,-1840;100,-2580;-40,-4930;-330,-5910;-2190,-6340;-2750,-5670;-2760,-5380;-200,-330;-3250,-5610;-3630,-5670;-3950,-4620;-3340,-3220;-3880,-4950;-3150,-6790;-3290,-6370;-2850,-6790;-2530,-6290;-160,-400;-2550,-6390;-2110,-6460;-2530,-5710;-20,-1800;-170,-840;280,-1580;-590,-7500;-410,-6090;-330,-5530;-560,-5960;-1320,-6590;-1300,-6760;400,-7500;-480,-6800;-1370,-6760;-490,-7290;240,-7260;-3790,-1100;-4660,-960;-4770,-1540;-4780,-1570;-4580,-1330;-300,-60;-4890,-1360;-4960,-1350;-5510,-900;-5400,1000;-5710,400;-5530,1410;-340,90;-5750,630;-5810,650;-5700,930;-5850,550;-5760,-660;-11050,-9610;-6720,-6510;-8380,-6700;-10540,-6540;-9070,-7260;-8340,-5930;-7900,-3970;-6980,-5810;-390,-280;-6960,-5550;-7050,-5140;-8010,-2090;-8250,-4060;-8350,-4790;-8100,-4180;-7140,-3100;-7430,-4430;-7040,-4280;-6730,-4120;-6140,-4270;-6060,-3570;-6650,-4000;-6950,-4290;-6800,-4890;-6430,-3890;-6040,-3520;-6640,-3890;-6530,-3430;-6890,-3770;-6090,-3560;-6540,-4030;-7020,-4180;-6090,-3840;-6650,-4520;-6790,-3920;-6430,-3660;-5070,-2570;-6510,-4180;-7120,-4350;-6120,-3700;-4900,-3030;-4880,-2890;-6070,-4260;-7780,-5480;-7590,-4860;-7840,-5370;-6430,-4510;-6130,-2320;-5480,-1250;-5250,-1250;-5930,-1150;-6480,-1610;-6050,-1560;-5940,-2040;-6730,-2360;-7750,-2020;-8300,-2430;-8060,-2960;-7910,-3070;-7940,-4790;-8710,-6050;-8750,-5100;-8350,-4410;-8290,-5410;-7140,-4120;-7410,-4040;-6680,-5590;-6250,-4030;-6580,-4820;-6420,-4710;-7800,-4520;-7340,-6340;-6870,-6380;-6730,-5530;-6710,-4640;-6080,-5410;-5380,-5760;-5590,-2020;-6360,-6070;-6280,-4920;-7070,-4770;-6310,-4870;-7100,-5360;-6280,-6170;-7100,150;-7950,1370;-9290,1280;-9160,1890;-8770,1770;-10060,3430;-9440,3090;-9600,3560;-9190,4120;-9280,3830;-8300,2170;-7860,2390;-7820,1780;-6760,2060;-7990,3810;-9100,4440;-8620,2500;-8940,4410;-8360,4660;-7990,4040;-8170,4920;-9310,6130;-9000,4360;-9000,5700;-8200,-3440;-9050,-6640;-9040,-6000;-9340,-7090;-7630,-4570;-15560,-7980;-7520,-5420;-7780,-6030;-8810,-5180;-7640,-5140;-6900,-6140;-6950,-6470;-7490,-5020;-7200,-4090;-6340,-6930;-320,-340;-6850,-6740;-5830,-5870;-4690,-950;-5680,3580;-5110,5170;-5650,5480;-6320,5250;-6310,6130;-5410,6020;-6030,-810;-6100,-3690;-6010,-3830;-6120,-4400;-6370,-4140;-6680,-4400;-5340,-5120;-5030,-5220;-5240,-5430;-4790,-590;-5230,4380;-4770,2020;-5090,2970;-5400,3690;-5010,4530;-5360,4470;-5920,4460;-5030,4080;-5020,5390;-5420,5140;-6570,5180;-6290,1540;-7120,-3840;-7060,-3340;-6240,-3170;-230,-140;-6750,-3470;-6470,-3150;-6200,-3030;-5420,-1880;-5190,-2990;-4920,-2940;-5270,-4450;-5420,-2490;-5050,-1510;-4900,-1720;-4910,-1820;-5400,-1350;-4920,-1470;-4970,-1430;-5520,-2650;-5140,-4030;-5450,-3200;-5150,-4510;-5550,-4280;-5370,-3610;-5520,-3580;-5570,-1860;-6040,-3100;-5440,-3950;-5970,-4380;-6660,-3750;-5650,-4060;-6080,-4030;-6400,-4180;-5540,-4390;-4650,-3900;-4460,-3890;-4770,-4990;-4290,-4290;-3950,-5050;-4060,-1530;-5440,1560;-5790,1970;-5460,1770;-5140,1540;-5260,1990;-5090,2120;-4660,2470;-5290,2750;-5260,2840;-5870,3820;-5490,3530;-5820,4760;-5650,4780;-6310,4240;-6520,3620;-7150,3760;-7530,3760;-6890,2420;-6970,2590;-6940,2710;-7160,2470;-6600,1160;-6240,-1300;-6390,-1170;-6980,-1550;-7720,-2050;-7980,-2640;-8560,-1830;-8520,-2310;-7600,-2390;-6840,-2540;-6890,-1850;-6550,-1430;-5690,-580;-4960,-630;-5210,630;-4770,1320;-4860,1610;-5050,1440;-5490,1660;-5760,1590;-6040,2070;-5890,2250;-5280,1430;-5440,880;-4790,680;-3250,0;-1590,-4980;-1110,-5390;-1930,-6600;-1480,-6630;-1960,-6960;-2480,-6940;-2750,-6180;-3740,-7150;-3750,-6120;-3940,-6330;-4060,-6790;-4780,-7240;-5480,-6580;-5360,-6950;-5170,-7700;-4990,-7090;-2440,-3795;-2440,-3795;-4600,-7440;-4800,-7010;-4590,-6900;-5120,-6550;-7080,-390;-7740,760;-8000,40;-7990,-110;-9250,-100;-9330,-90;-6260,-2960;-2230,-3880;-1800,-3720;-2170,-4570;-1970,-4370;-2950,-1740;-3940,-750;-4090,-490;-2840,-2390;-2370,-3920;-1990,-3090;-3490,-270;-3580,70;-2730,-1770;-1790,-3740;-1920,-4270;-2020,-4560;-1530,-3640;670,-650;-50,-90;-410,-250;0,0;-10,0;0,-10;0,10;10,-10;-10,0;0,0;10,0;0,0;-10,10;10,0;0,0;-10,-10;0,10;10,0;0,0;-10,0;10,0;-10,0;0,-10;0,0;10,0;0,10;0,0;0,-10;0,0;0,10;0,0;0,-10;0,10;-10,-20,1680|">
  <q>9</q>
  <p>Nicolas D'estais</p>
  <p>Librairie Cheminant</p>
  <p>Arrivé le 27/09/2015 à 04:44:44 UTC</p>
  <p>En 7j 15h 14min 44s</p>
  <dfn class="hulls1">
    <dt>2015/09/27T14:27:00Z</dt>
    <dl>28.96470;-13.53840</dl>
    <dd>243</dd>
  </dfn>
</div>

1 个答案:

答案 0 :(得分:0)

脚本是正确的,但是您从给定网址接收的数据不包含任何以行以行开头的div标记。第二件事是你找到了不存在的东西。 以下是您的网址输出:

   <!doctype html>
  <html>
    <head>
      <meta charset="utf-8" />
      <title>Mini Transat 2015</title>
      <meta name="apple-mobile-web-app-capable" content="yes">
      <meta name="viewport" content="     initial-scale=1.0,width=device-width,user-scalable=0,minimum-scale=1.0,     maximum-scale=1.0">
      <link rel="stylesheet" type="text/css" href="/2015/_elements/default/css/     reset.css?v=63583546506" />
  <link rel="stylesheet" type="text/css" href="/2015/home/main.css?v=63580587082" />
  <link rel="stylesheet" type="text/css" href="/2015/_elements/custom/      home.css?v=63578166019" />
  <script type="text/javascript" language="javascript" src="/2015/_elements/plugins/      jquery/jquery.1.12.2.js?v=63595118555"></script>
  <script type="text/javascript" language="javascript" src="/2015/home/     main.js?v=63577584956"></script>

      <script>
        $(window).load(function () {
          GeovoileHomePage.chrono.start("{0}<i>j</i> {1}<i>h</i> {2}<i>min</i> {3}<i>     s</i>", "2015/10/31 14:10:00 GMT+0", "2015/10/31 14:10:00 GMT+0");
        });
      </script>
    </head>
    <body>
      <header>
        <a id="logo" href="http://www.minitransat-ilesdeguadeloupe.fr/" target="_blank      "></a>
        <div id="chrono"></div>
        <div id="menu"><tt><span></span></tt><nav>
  <ul id="legs"><li><a href="?leg=1" rel="1"><big>Étape 1</big><small>E 1</small></a>     </li><li class="on"><a href="?leg=2" rel="2"><big>Étape 2</big><small>E 2</small></a    ></li></ul>
  <ul class="pages" rel="1"><li rel="tracker_1"><a href="tracker/?leg=1">Cartographie     </a></li><li rel="mainboard_1"><a href="widgets/mainboard/?leg=1">Classements</a></     li><li rel="graphics_1"><a href="widgets/graphics/?leg=1">Graphiques</a></li><li rel    ="statistics_1"><a href="widgets/statistics/?leg=1">Statistiques</a></li><li rel="     general_1"><a href="widgets/general/?leg=1">Général</a></li></ul>
  <ul class="pages on" rel="2"><li rel="tracker_2"><a href="tracker/?leg=2">      Cartographie</a></li><li rel="mainboard_2"><a href="widgets/mainboard/?leg=2">      Classements</a></li><li rel="graphics_2"><a href="widgets/graphics/?leg=2">     Graphiques</a></li><li rel="statistics_2"><a href="widgets/statistics/?leg=2">    Statistiques</a></li><li rel="general_2"><a href="widgets/general/?leg=2">Général</a    ></li></ul></nav></div>
      </header>
      <main><iframe id="tracker_1" src="_blank.html"></iframe><iframe id="mainboard_1"       src="_blank.html"></iframe><iframe id="graphics_1" src="_blank.html"></iframe><      iframe id="statistics_1" src="_blank.html"></iframe><iframe id="general_1" src="      _blank.html"></iframe><iframe id="tracker_2" src="_blank.html"></iframe><iframe       id="mainboard_2" src="_blank.html"></iframe><iframe id="graphics_2" src="     _blank.html"></iframe><iframe id="statistics_2" src="_blank.html"></iframe><    iframe id="general_2" src="_blank.html"></iframe></main>
      <footer>Mises à jour :  01h30  UTC • 03h00  UTC • 05h00  UTC • 08h00  UTC •       11h00  UTC • 14h00  UTC • 17h00  UTC • 19h00  UTC • 19h30  UTC • 20h00  UTC •       23h00  UTC<div id="time">{3}h{4} UTC|0</div></footer>
      <div id="finistere"></div><div id="bretagne"></div><a id="materne" href="http://      www.materne.fr/" target="_blank" title="www.materne.fr"></a><div id="minitransat      ">Douarnenez ► Lanzarote ► Pointe-à-Pitre</div>
    </body>
    <!--TRL_OK-->
  </html>

我尝试编辑您的脚本以查看代码是否真正有效。下面是修改后的代码,它给出了一些输出:         导入请求         来自lxml import html

    url="http://minitransat.geovoile.com/2015/"
    headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'}

    def scrape_info(main_link):
        response = requests.get(main_link,headers=headers).text
        tree = html.fromstring(response)
        print(tree.xpath("//div/text()"))

这只包含一个项目,因此不需要迭代它。 以下是该计划的输出:

['{3}h{4} UTC|0', 'Douarnenez ► Lanzarote ► Pointe-à-Pitre']

我希望这会对你有所帮助。我建议检查您在这种情况下收到的数据。