无法从某些元素中删除特定项目

时间:2017-09-06 12:52:03

标签: python python-3.x web-scraping css-selectors

如果没有要选择的容器或组来解析所需的项目(在每个组中是常见的),循环使用该怎么办?我愿意从粘贴的元素中解析文本,日期和作者。我之后的三个结果不属于任何特定的组或容器,所以我找不到让它们创建循环的正确方法。

以下是要素:

html = '''
<div class="view-content">            
  <p class="text-large experts-more-h">   
  <a href="/publications/commentary/we-have-no-idea-universal-preschool-actually-helps-kids">We Have No Idea if Universal Preschool Actually Helps Kids</a>
  </p>
  <p class="text-sans">    
  By David J. Armor. Washington Post. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-10-21T09:34:00-04:00">October 21, 2014</span>.
  </p>        
  <p class="text-large experts-more-h">   
  <a href="/publications/commentary/last-parent-resistance-collective-standardized-tests">At Last, Parent Resistance to Collective Standardized Tests</a>
  </p>
  <p class="text-sans">    
  By Nat Hentoff. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-01-15T09:57:00-05:00">January 15, 2014</span>.
  </p>  
  <p class="text-sans">    
  By Darcy Ann Olsen and Eric Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1999-04-15T00:00:00-04:00">April 15, 1999</span>.
  </p>       
  <p class="text-large experts-more-h">   
  <a href="/publications/commentary/day-care-parents-versus-professional-advocates-0">Day Care: Parents versus Professional Advocates</a>
  </p>
  <p class="text-sans">   
  By Darcy Ann Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1998-06-01T00:00:00-04:00">June 1, 1998</span>.
  </p>  
</div>
'''

如果您运行我的脚本,您可以看到已删除的结果只是第一个:

from lxml.html import fromstring

tree = fromstring(html)
post= tree.cssselect(".text-large a")[0].text
date = tree.cssselect(".date-display-single")[0].text
author = tree.cssselect(".text-sans")[0].text.strip()
print(post+'\n', date+'\n', author)

结果:

We Have No Idea if Universal Preschool Actually Helps Kids
 October 21, 2014
 By David J. Armor. Washington Post.

如果您运行此脚本,您将看到此脚本能够解析我之后的所有结果:

from lxml.html import fromstring

tree = fromstring(html)
count = tree.cssselect(".text-large a")

for item in range(len(count)):
    post= tree.cssselect(".text-large a")[item].text
    date = tree.cssselect(".date-display-single")[item].text
    author = tree.cssselect(".text-sans")[item].text.strip()
    print(post+'\n', date+'\n', author)

结果:

We Have No Idea if Universal Preschool Actually Helps Kids
 October 21, 2014
 By David J. Armor. Washington Post.
At Last, Parent Resistance to Collective Standardized Tests
 January 15, 2014
 By Nat Hentoff. Cato.org.
Day Care: Parents versus Professional Advocates
 April 15, 1999
 By Darcy Ann Olsen and Eric Olsen. Cato.org.

但是,我对第二个脚本所做的事情根本不是pythonic,如果缺少任何数据,它会给出错误的结果。那么,如何选择一个组或容器,循环并解析所有这些?提前谢谢。

1 个答案:

答案 0 :(得分:1)

如果缺少其中一个文本节点(<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <div> <h2>System A</h2> <input type="radio" name="system-a" value="1">Option 1</input> <input type="radio" name="system-a" value="2">Option 2</input> <input type="radio" name="system-a" value="3">Option 3</input> <input type="radio" name="system-a" value="4">Option 4</input> </div> <div> <h2>System B</h2> <input type="radio" name="system-b" value="1">Option 1</input> <input type="radio" name="system-b" value="2">Option 2</input> <input type="radio" name="system-b" value="3">Option 3</input> <input type="radio" name="system-b" value="4">Option 4</input> </div>postdate),author应该返回一个您无法处理的tree.cssselect(selector)[index].text对象一个字符串。为避免这种情况,您可以实施

NoneType

您也可以尝试以下post= tree.cssselect(".text-large a")[item].text or " " 解决方案:

XPath