如果没有要选择的容器或组来解析所需的项目(在每个组中是常见的),循环使用该怎么办?我愿意从粘贴的元素中解析文本,日期和作者。我之后的三个结果不属于任何特定的组或容器,所以我找不到让它们创建循环的正确方法。
以下是要素:
html = '''
<div class="view-content">
<p class="text-large experts-more-h">
<a href="/publications/commentary/we-have-no-idea-universal-preschool-actually-helps-kids">We Have No Idea if Universal Preschool Actually Helps Kids</a>
</p>
<p class="text-sans">
By David J. Armor. Washington Post. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-10-21T09:34:00-04:00">October 21, 2014</span>.
</p>
<p class="text-large experts-more-h">
<a href="/publications/commentary/last-parent-resistance-collective-standardized-tests">At Last, Parent Resistance to Collective Standardized Tests</a>
</p>
<p class="text-sans">
By Nat Hentoff. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-01-15T09:57:00-05:00">January 15, 2014</span>.
</p>
<p class="text-sans">
By Darcy Ann Olsen and Eric Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1999-04-15T00:00:00-04:00">April 15, 1999</span>.
</p>
<p class="text-large experts-more-h">
<a href="/publications/commentary/day-care-parents-versus-professional-advocates-0">Day Care: Parents versus Professional Advocates</a>
</p>
<p class="text-sans">
By Darcy Ann Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1998-06-01T00:00:00-04:00">June 1, 1998</span>.
</p>
</div>
'''
如果您运行我的脚本,您可以看到已删除的结果只是第一个:
from lxml.html import fromstring
tree = fromstring(html)
post= tree.cssselect(".text-large a")[0].text
date = tree.cssselect(".date-display-single")[0].text
author = tree.cssselect(".text-sans")[0].text.strip()
print(post+'\n', date+'\n', author)
结果:
We Have No Idea if Universal Preschool Actually Helps Kids
October 21, 2014
By David J. Armor. Washington Post.
如果您运行此脚本,您将看到此脚本能够解析我之后的所有结果:
from lxml.html import fromstring
tree = fromstring(html)
count = tree.cssselect(".text-large a")
for item in range(len(count)):
post= tree.cssselect(".text-large a")[item].text
date = tree.cssselect(".date-display-single")[item].text
author = tree.cssselect(".text-sans")[item].text.strip()
print(post+'\n', date+'\n', author)
结果:
We Have No Idea if Universal Preschool Actually Helps Kids
October 21, 2014
By David J. Armor. Washington Post.
At Last, Parent Resistance to Collective Standardized Tests
January 15, 2014
By Nat Hentoff. Cato.org.
Day Care: Parents versus Professional Advocates
April 15, 1999
By Darcy Ann Olsen and Eric Olsen. Cato.org.
但是,我对第二个脚本所做的事情根本不是pythonic,如果缺少任何数据,它会给出错误的结果。那么,如何选择一个组或容器,循环并解析所有这些?提前谢谢。
答案 0 :(得分:1)
如果缺少其中一个文本节点(<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div>
<h2>System A</h2>
<input type="radio" name="system-a" value="1">Option 1</input>
<input type="radio" name="system-a" value="2">Option 2</input>
<input type="radio" name="system-a" value="3">Option 3</input>
<input type="radio" name="system-a" value="4">Option 4</input>
</div>
<div>
<h2>System B</h2>
<input type="radio" name="system-b" value="1">Option 1</input>
<input type="radio" name="system-b" value="2">Option 2</input>
<input type="radio" name="system-b" value="3">Option 3</input>
<input type="radio" name="system-b" value="4">Option 4</input>
</div>
,post
,date
),author
应该返回一个您无法处理的tree.cssselect(selector)[index].text
对象一个字符串。为避免这种情况,您可以实施
NoneType
您也可以尝试以下post= tree.cssselect(".text-large a")[item].text or " "
解决方案:
XPath