xpath奇怪的行为。 Python html解析

时间:2017-07-07 13:33:33

标签: python html xpath

我有一个HTML页面。我最后会引用它。 我准备了这个页面。

    page=page.content
    res = html.fromstring(page)

我向它发出xpath请求:

list_of_names = res.xpath(你' // li / a / text()')

但它没有列出。

下的名字

当我这样做时: list_of_names = res.xpath(你' // div [@id =' rosterlists'] / div / li / a / text()')

在浏览器中我得到了我想要的东西 - 名单。 enter image description here 但在Python中我得到了 [' A'' B',' C',' D',' E',' F']

有什么问题?它破坏了HTML吗?如果是 - 如何解决?

xpath适用于同一台机器上的所有其他页面(数百和数千)

请 - 不要向我推荐美丽的SOAP - 这个模块对于这个项目来说是不合适的。在任何情况下都不会。

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head profile="http://www.w3.org/2005/10/profile">
<link rel="icon" type="image/ico" href="/favicon.ico" />
<title>Roster |  Primary Talent International</title>
<meta name="description" content="Roster -  Primary Talent International" />
<meta name="keywords" content="" />
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta http-equiv="content-language" content="en" />
<meta name="language" content="en" />
<meta name="robots" content="index, follow" />
<meta name="author" content="Inogen Web Design Nottingham" />

<link href="/css/style.css" rel="stylesheet" type="text/css" media="screen"/>
<link href="/css/scrollerstyle.css" rel="stylesheet" type="text/css" media="screen"/>

<link href="/css/roster.css" rel="stylesheet" type="text/css" media="screen"/>

<script type="text/javascript" src="/scripts/ddwindowlinks.js"></script>
<script type="text/javascript" src="/scripts/scroller-settings.js"></script>


<script type="text/javascript">

//<![CDATA[

var sglm=new Array();

sglm[0]='<a href="/news/jul2017#wolf-alice-visions-of-a-life-album">Wolf Alice: &#039;Visions Of A Life&#039; Album</a>';
sglm[1]='<a href="/news/jul2017#zombie-nation-new-video-for-knockout">Zombie Nation: New Video For &#039;Knockout&#039;</a>';
sglm[2]='<a href="/news/jul2017#sextile-albeit-living-review">Sextile: &#039;Albeit Living&#039; Review</a>';
sglm[3]='<a href="/news/jul2017#noisia-at-glastonbury-festival-2017">Noisia: At Glastonbury Festival 2017</a>';
sglm[4]='<a href="/news/jul2017#joe-ford-new-track-make-a-threat-ft-maluk">Joe Ford: New Track &#039;Make A Threat&#039; Ft. Maluk</a>';
sglm[5]='<a href="/news/jul2017#moscoman-obscure-cuts-on-xlr8r">Moscoman: &#039;Obscure Cuts&#039; On XLR8R</a>';
sglm[6]='<a href="/news/jul2017#james-welsh-new-thread-north-ep">James Welsh: New &#039;Thread/North&#039; EP</a>';
sglm[7]='<a href="/news/jul2017#steve-lamacq-going-deaf-for-a-living-tour">Steve Lamacq: &#039;Going Deaf For A Living&#039; Tour</a>';
sglm[8]='<a href="/news/jul2017#moscoman-remixes-cristobal-and-the-sea">Moscoman: Remixes Cristobal &amp; The Sea</a>';
sglm[9]='<a href="/news/jul2017#nadine-khouri-at-the-lexington-london">Nadine Khouri: At The Lexington, London</a>';
sglm[10]='<a href="/news/jul2017#faze-miyake-new-infamous-ep">Faze Miyake: New &#039;Infamous&#039; EP</a>';
sglm[11]='<a href="/news/jul2017#noisia-beyond-the-outer-edges-featured-in-skiddle">Noisia: &#039;Beyond The Outer Edges&#039; Featured In Skiddle</a>';

//]]>

</script>


<script type="text/javascript">

  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-17266356-2']);
  _gaq.push(['_trackPageview']);

  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();

</script>



</head>


<body onload="startbcscroll();">
<div id="wrapper">

<div id="masthead">

<a href="/"><img src="/images/primary-talent-international.png" alt="Primary Talent International" width="330" height="40" id="logo" /></a>



<form action="/search/" method="get">
<fieldset id="search">
<input type="text" name="find" />
<input type="submit" value="search" class="submit" />
</fieldset>
</form>


<div id="masthead-images">
<a href="/counterfeit/"><img src="/artists/counterfeit/images/127x127/1.jpg" width="127" height="127" alt="Counterfeit" title="Counterfeit" /></a>
<a href="/fredo/"><img src="/artists/fredo/images/127x127/1.jpg" width="127" height="127" alt="Fredo" title="Fredo" /></a>
<a href="/shabazz-palaces/"><img src="/artists/shabazz-palaces/images/127x127/1.jpg" width="127" height="127" alt="Shabazz Palaces" title="Shabazz Palaces" /></a>
<a href="/tom-demac/"><img src="/artists/tom-demac/images/127x127/1.jpg" width="127" height="127" alt="Tom Demac" title="Tom Demac" /></a>
<a href="/kwamz-and-flava/"><img src="/artists/kwamz-and-flava/images/127x127/1.jpg" width="127" height="127" alt="Kwamz & Flava" title="Kwamz & Flava" /></a>
<a href="/jelani-blackman/"><img src="/artists/jelani-blackman/images/127x127/1.jpg" width="127" height="127" alt="Jelani Blackman" title="Jelani Blackman" /></a>
<a href="/blinkie/"><img src="/artists/blinkie/images/127x127/1.jpg" width="127" height="127" alt="Blinkie" title="Blinkie" /></a>
</div>

<div id="ticker">
<script type="text/javascript" src="/scripts/scroller.js"></script>
</div>

<div id="menu">
<div id="topmenu">

<ul>

<li class="open"><a href="/roster/">LIVE ROSTER</a></li>
<li><a href="/dj-roster/">DJ ROSTER</a><li>

<li><a href="/news/">NEWS</a><li>
<li><a href="/on-tour/">ON TOUR</a></li>
<li><a href="/about-us/">ABOUT US</a></li>
<li><a href="/new-signings/">NEW SIGNINGS</a></li>

</ul>

</div></div>

</div>





<div id="content">

<div id="rosterlists">

<div>

<ul>
</ul>
</div>
<div>
</ul>
<li><a href="/sandy-alex-g/" />(Sandy) Alex G</a></li>
<li><a href="/andyouwillknowusbythetrailofdead/" />...And You Will Know Us By The Trail Of Dead</a></li>
<li><a href="/2shy/" />2Shy</a></li>
<li><a href="/808ink/" />808INK</a></li>
<li>&nbsp;</li>
<li class="initial"><a href="/roster/a/">A</a></li>
<li><a href="/a-tribe-called-red/" />A Tribe Called Red</a></li>
<li><a href="/abattoir-blues/" />Abattoir Blues</a></li>
<li><a href="/aeroplane/" />Aeroplane</a></li>
<li><a href="/agar-agar/" />Agar Agar</a></li>
<li><a href="/airways/" />Airways</a></li>
<li><a href="/alaskalaska/" />ALASKALASKA</a></li>
<li><a href="/alex-izenberg/" />Alex Izenberg</a></li>
<li><a href="/all-get-out/" />All Get Out</a></li>
<li><a href="/all-the-people/" />All The People</a></li>
<li><a href="/allison-weiss/" />Allison Weiss</a></li>
<li><a href="/alpines/" />Alpines</a></li>
<li><a href="/alt-j/" />Alt-J</a></li>
<li><a href="/alvvays/" />Alvvays</a></li>
<li><a href="/ama-lou/" />Ama Lou</a></li>
<li><a href="/amaroun/" />Amaroun</a></li>
<li><a href="/andrea/" />AndreaLo</a></li>
<li><a href="/andy-cooper/" />Andy Cooper (Ugly Duckling)</a></li>
<li><a href="/anna-calvi/" />Anna Calvi</a></li>
<li><a href="/anteros/" />Anteros</a></li>
<li><a href="/apes-and-horses/" />Apes & Horses</a></li>
<li><a href="/ara/" />ArA Harmonic</a></li>
<li><a href="/araabmuzik/" />Araabmuzik</a></li>
<li><a href="/archive/" />Archive</a></li>
<li><a href="/aristophanes/" />Aristophanes</a></li>
<li><a href="/ash-koosha/" />Ash Koosha</a></li>
<li><a href="/atlas-genius/" />Atlas Genius</a></li>
<li><a href="/augustines/" />Augustines</a></li>
</ul>
</div>
<div>
</ul>
</ul>
</div>
<div>
</ul>
<li><a href="/avelino/" />Avelino</a></li>
<li><a href="/awate/" />Awate</a></li>
<li><a href="/azad/" />Azad</a></li>
<li><a href="/azusena/" />Azusena</a></li>
<li>&nbsp;</li>
<li class="initial"><a href="/roster/b/">B</a></li>
<li><a href="/baba-naga/" />Baba Naga</a></li>
<li><a href="/babeheaven/" />Babeheaven</a></li>
<li><a href="/babyshambles/" />Babyshambles</a></li>
<li><a href="/bad-gyal/" />Bad Gyal</a></li>
<li><a href="/bad-kid/" />Bad Kid</a></li>
<li><a href="/bad-nerves/" />Bad Nerves</a></li>
<li><a href="/bad-pop/" />Bad Pop</a></li>
<li><a href="/bad-sounds/" />Bad Sounds</a></li>
<li><a href="/banners/" />Banners</a></li>
<li><a href="/basement-jaxx/" />Basement Jaxx</a></li>
<li><a href="/bash-and-pop/" />Bash & Pop</a></li>
<li><a href="/bay/" />BAY</a></li>
<li><a href="/bayside/" />Bayside</a></li>
<li><a href="/be-charlotte/" />Be Charlotte</a></li>
<li><a href="/beach-baby/" />Beach Baby</a></li>
<li><a href="/beach-slang/" />Beach Slang</a></li>
<li><a href="/beardyman/" />Beardyman</a></li>
<li><a href="/bellevue-days/" />Bellevue Days</a></li>
<li><a href="/ben-hobbs/" />Ben Hobbs</a></li>
<li><a href="/ben-khan/" />Ben Khan</a></li>
<li><a href="/ben-watt/" />Ben Watt</a></li>
<li><a href="/benny-mails/" />Benny Mails</a></li>
<li><a href="/bettens/" />Bettens</a></li>
<li><a href="/big-ups/" />Big Ups</a></li>
<li><a href="/bipolar-sunshine/" />Bipolar Sunshine</a></li>
<li><a href="/birds-of-tokyo/" />Birds Of Tokyo</a></li>
<li><a href="/blaenavon/" />Blaenavon</a></li>
</ul>
</div>
<div>
</ul>
</ul>
</div>
<div>
</ul>
<li><a href="/bloodhound-gang/" />Bloodhound Gang</a></li>
<li><a href="/bloxx/" />BLOXX</a></li>
<li><a href="/blue-daisy/" />Blue Daisy</a></li>
<li><a href="/bowling-for-soup/" />Bowling For Soup</a></li>
<li><a href="/boys-noize/" />Boys Noize</a></li>
<li><a href="/broadway-sounds/" />Broadway Sounds</a></li>
<li><a href="/brooke-candy/" />Brooke Candy</a></li>
<li><a href="/bryde/" />Bryde</a></li>
<li><a href="/buraka-som-sistema/" />Buraka Som Sistema</a></li>
<li>&nbsp;</li>
<li class="initial"><a href="/roster/c/">C</a></li>
<li><a href="/cadet/" />Cadet</a></li>
<li><a href="/cant-swim/" />Can't Swim</a></li>
<li><a href="/candy-hearts/" />Candy Hearts</a></li>
<li><a href="/cardiknox/" />Cardiknox</a></li>
<li><a href="/carmody/" />Carmody</a></li>
<li><a href="/catfish-and-the-bottlemen/" />Catfish and the Bottlemen</a></li>
<li><a href="/cattle-and-cane/" />Cattle &amp; Cane</a></li>
<li><a href="/central-cee/" />Central Cee</a></li>
<li><a href="/cerrone/" />Cerrone</a></li>
<li><a href="/chairlift/" />Chairlift</a></li>
<li><a href="/champs/" />Champs</a></li>
<li><a href="/charlotte-oc/" />Charlotte OC</a></li>
<li><a href="/charly-bliss/" />Charly Bliss</a></li>
<li><a href="/children-collide/" />Children Collide</a></li>
<li><a href="/cigarettes-after-sex/" />Cigarettes After Sex</a></li>
<li><a href="/circawaves/" />Circa Waves</a></li>
<li><a href="/clairy-browne/" />Clairy Browne</a></li>
<li><a href="/clean-spill/" />Clean Spill</a></li>
<li><a href="/coco/" />Coco</a></li>
<li><a href="/cold-specks/" />Cold Specks</a></li>
<li><a href="/cole/" />Cole</a></li>
<li><a href="/connan-mockasin/" />Connan Mockasin</a></li>
</ul>
</div>
<div>
</ul>
</ul>
</div>
<div>
</ul>
<li><a href="/cosmo-pyke/" />Cosmo Pyke</a></li>
<li><a href="/count-counsellor/" />Count Counsellor</a></li>
<li><a href="/counterfeit/" />Counterfeit</a></li>
<li><a href="/crossfaith/" />Crossfaith</a></li>
<li><a href="/crows/" />Crows</a></li>
<li><a href="/cuckoolander/" />CuckooLander</a></li>
<li>&nbsp;</li>
<li class="initial"><a href="/roster/d/">D</a></li>
<li><a href="/d-double-e/" />D Double E</a></li>
<li><a href="/damfunk/" />D&#257;M-FunK</a></li>
<li><a href="/daft-punk/" />Daft Punk</a></li>
<li><a href="/daisy-victoria/" />Daisy Victoria</a></li>
<li><a href="/daniel-og/" />Daniel OG</a></li>
<li><a href="/darkstar/" />Darkstar</a></li>
<li><a href="/darlia/" />Darlia</a></li>
<li><a href="/dave/" />Dave</a></li>
<li><a href="/day-wave/" />Day Wave</a></li>
<li><a href="/decade/" />Decade</a></li>
<li><a href="/delta-rae/" />Delta Rae</a></li>
<li><a href="/denzel-himself/" />Denzel Himself</a></li>
<li><a href="/desert-planes/" />Desert Planes</a></li>
<li><a href="/digable-planets/" />Digable Planets </a></li>
<li><a href="/digitalism/" />Digitalism</a></li>
<li><a href="/DIIV/" />DIIV</a></li>
<li><a href="/dilly-dally/" />Dilly Dally</a></li>
<li><a href="/diztortion/" />Diztortion</a></li>
<li><a href="/dizzee-rascal/" />Dizzee Rascal</a></li>
<li><a href="/dj-cassidy/" />DJ Cassidy</a></li>
<li><a href="/django-django/" />Django Django</a></li>
<li><a href="/dmas/" />DMA's</a></li>
<li><a href="/dominique-young-unique/" />Dominique Young Unique</a></li>
<li><a href="/dropkick-murphys/" />Dropkick Murphys</a></li>
<li><a href="/drowners/" />Drowners</a></li>
</ul>
</div>
<div>
</ul>
</ul>
</div>
<div>
</ul>
<li><a href="/dub-pistols/" />Dub Pistols</a></li>
<li><a href="/dutch-mob/" />Dutch Mob</a></li>
<li><a href="/zappa-plays-zappa/" />Dweezil Zappa</a></li>
<li>&nbsp;</li>
<li class="initial"><a href="/roster/e/">E</a></li>
<li><a href="/eat-fast/" />Eat Fast</a></li>
<li><a href="/eera/" />EERA</a></li>
<li><a href="/emily-capell/" />Emily Capell</a></li>
<li><a href="/emmy-the-great/" />Emmy The Great</a></li>
<li><a href="/eprom/" />Eprom</a></li>
<li><a href="/esther-joy/" />Esther Joy</a></li>
<li><a href="/etienne-de-crecy-presents-super-discount/" />Etienne de Cr&#233;cy Presents Super Discount 3</a></li>
<li>&nbsp;</li>
<li class="initial"><a href="/roster/f/">f</a></li>
<li><a href="/franskild-live/" />f r a n s k i l d (Live)</a></li>
<li><a href="/fang-night/" />Fang Night</a></li>
<li><a href="/fangclub/" />Fangclub</a></li>
<li><a href="/felix-riebl/" />Felix Riebl</a></li>
<li><a href="/fine-print/" />Fine Print</a></li>
<li><a href="/first-hate/" />First Hate</a></li>
<li><a href="/forever-came-calling/" />Forever Came Calling</a></li>
<li><a href="/fours/" />FOURS</a></li>
<li><a href="/foxygen/" />Foxygen</a></li>
<li><a href="/freak/" />FREAK</a></li>
<li><a href="/fredo/" />Fredo</a></li>
<li><a href="/fun-lovin-criminals/" />Fun Lovin' Criminals</a></li>

</ul>

</div>

<div class="clear">&nbsp;</div>

</div>





<div id="alphabetmenu">
<ul>


<li class="active"><a href="/roster/">#<a></li>
<li class="active"><a href="/roster/a/">A</a></li>
<li class="active"><a href="/roster/b/">B</a></li>
<li class="active"><a href="/roster/c/">C</a></li>
<li class="active"><a href="/roster/d/">D</a></li>
<li class="active"><a href="/roster/e/">E</a></li>
<li class="active"><a href="/roster/f/">F</a></li>
<li class="active"><a href="/roster/g/">G</a></li>
<li><a href="/roster/h/">H</a></li>
<li><a href="/roster/i/">I</a></li>
<li><a href="/roster/j/">J</a></li>
<li><a href="/roster/k/">K</a></li>
<li><a href="/roster/l/">L</a></li>
<li><a href="/roster/m/">M</a></li>
<li><a href="/roster/n/">N</a></li>
<li><a href="/roster/o/">O</a></li>
<li><a href="/roster/p/">P</a></li>
<li><a href="/roster/q/">Q</a></li>
<li><a href="/roster/r/">R</a></li>
<li><a href="/roster/s/">S</a></li>
<li><a href="/roster/t/">T</a></li>
<li><a href="/roster/u/">U</a></li>
<li><a href="/roster/v/">V</a></li>
<li><a href="/roster/w/">W</a></li>
<li><a href="/roster/x/">X</a></li>
<li><a href="/roster/y/">Y</a></li>
<li><a href="/roster/z/">Z</a></li>

</ul>

</div>
<div class="clear">&nbsp;</div>

<div id="agentlist">

<h1>Contact</h1>

<ul>
<li><a href="http://decked-out.co.uk/alessia-avallone/">Alessia Avallone</a></li>
<li><a href="/andy-duggan/">Andy Duggan</a></li>
<li><a href="/andy-woolliscroft/">Andy Woolliscroft</a></li>
<li><a href="/ben-winchester/">Ben Winchester</a></li>
<li><a href="/charlie-renton/">Charlie Renton</a></li>
<li><a href="/chris-smyth/">Chris Smyth</a></li>
<li><a href="/cils-fyne-williams/">Cils Fyne-Williams</a></li>
<li><a href="/claire-reilly/">Claire Reilly</a></li>
<li><a href="/craig-dsouza/">Craig D'Souza</a></li>
<li><a href="/dave-chumbley/">Dave Chumbley</a></li>
<li><a href="/ed-sellers/">Ed Sellers</a></li>
<li><a href="/eileen-mulligan/">Eileen Mulligan</a></li>
<li><a href="/ellen-trickey/">Ellen Trickey</a></li>
<li><a href="http://decked-out.co.uk/faye-adams/">Faye Adams</a></li>
<li><a href="/francesco-caccamo/">Francesco Caccamo</a></li>
<li><a href="/jack-herron/">Jack Herron</a></li>
<li><a href="/kata-farkas/">Kata Farkas</a></li>
<li><a href="http://decked-out.co.uk/laetitia-descouens/">Laetitia Descouens</a></li>
<li><a href="http://decked-out.co.uk/lucinda-runham/">Lucinda Runham</a></li>
<li><a href="/martin-hopewell/">Martin Hopewell</a></li>
<li><a href="/martin-mackay/">Martin Mackay</a></li>
<li><a href="http://decked-out.co.uk/martje-kremers/">Martje Kremers</a></li>
<li><a href="/matt-bates/">Matt Bates</a></li>
<li><a href="/matt-pickering-copley/">Matt Pickering-Copley</a></li>
<li><a href="/moshope-osinibe/">Moshope Osinibi </a></li>
<li><a href="/nick-holroyd/">Nick Holroyd</a></li>
<li><a href="/nick-reddick/">Nick Reddick</a></li>
<li><a href="/paul-mcqueen/">Paul McQueen</a></li>
<li><a href="/peter-elliott/">Peter Elliott</a></li>
<li><a href="/sally-gavaghan/">Sally Gavaghan</a></li>
<li><a href="/scarlet-millar/">Scarlet Millar</a></li>
<li><a href="/serena-parsons/">Serena Parsons</a></li>
<li><a href="/stacey-owen/">Stacey Owen</a></li>
<li><a href="/steve-backman/">Steve Backman</a></li>
<li><a href="/tabbie-burleton/">Tabbie Burleton</a></li>
<li><a href="/tom-permaul-baker/">Tom Permaul-Baker</a></li>
<li><a href="/tracey-roper/">Tracey Roper</a></li>
<li><a href="/wesley-doogan/">Wesley Doogan</a></li>
<li><a href="/will-marshall/">Will Marshall</a></li>
</ul>

</div>



</div>



<div class="clear">&nbsp;</div>


<div id="footer">
<ul>
<li>&copy; 2017 Primary Talent International</li><li>|<a href="/tncs-of-use/">Terms &amp; Conditions of Use</a></li><li>| <a href="/privacy/">Privacy Policy</a></li><li>|<a href="/terms-of-business/">Terms of Business</a></li><li>|<a href="https://primarytalent.com">Contact Blog</a></li>
</ul>
</div>


</div>

</body>
</html>

1 个答案:

答案 0 :(得分:1)

此代码段正常。 text_content()方法为包含其他元素的元素提供了干净的文本

from lxml import html
import requests
req = requests.get('http://primarytalent.com/roster/')
tree = html.fromstring(req.content)
list_of_names = [_.text_content() for _ in tree.xpath("//*[@id='rosterlists']/div/li")]