Question

我是第一次尝试获取Python技能的海报;请善待我： - ）

虽然我不是编程概念的完全陌生人（我以前一直在乱用PHP），但过渡到Python对我来说有点困难。我想这主要与我缺乏大多数 - 如果不是全部 - 对常见“设计模式”（？）等基本理解的事实有关。

说到这就是问题所在。我目前的项目的一部分涉及利用Beautiful Soup编写一个简单的刮刀。要处理的数据与下面列出的数据有些相似。

<table>
    <tr>
        <td class="date">2011-01-01</td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr>
        <td class="date">2011-01-02</td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
</table>

主要问题是我无法理解如何1）跟踪当前日期（tr-> td class =“date”），同时2）循环后续tr中的项目：s（tr class =“item” - ＆gt; td class =“headline”和tr class =“item” - ＆gt; td class =“link”）和3）将处理过的数据存储在数组中。

此外，所有数据都将插入到数据库中，其中每个条目必须包含以下信息;

日期
标题
链接

请注意，crud：数据库不是问题的一部分，我只是为了更好地说明我在这里要完成的事情而提到这一点： - ）

现在，有许多不同的方法可以给猫皮肤。因此，尽管手头问题的解决方案确实非常受欢迎，但如果有人愿意详细阐述您为了“攻击”这类问题而使用的实际逻辑和策略，我将非常感激:-)

最后但同样重要的是，对于这样一个无趣的问题感到抱歉。

Answer 1

基本问题是此表标记为外观，而不是语义结构。如果处理得当，每个日期及其相关项目应该共享一个父项。不幸的是，他们没有，所以我们必须做。

基本策略是遍历表格中的每一行

如果第一个tabledata有类'date'，我们会得到日期值并更新last_seen_date
否则，我们会提取标题和链接，然后将（last_seen_date，标题，链接）保存到数据库

import BeautifulSoup

fname = r'c:\mydir\beautifulSoup.html'
soup = BeautifulSoup.BeautifulSoup(open(fname, 'r'))

items = []
last_seen_date = None
for el in soup.findAll('tr'):
    daterow = el.find('td', {'class':'date'})
    if daterow is None:     # not a date - get headline and link
        headline = el.find('td', {'class':'headline'}).text
        link = el.find('a').get('href')
        items.append((last_seen_date, headline, link))
    else:                   # get new date
        last_seen_date = daterow.text

Answer 2

您可以使用python包中包含的元素树。

http://docs.python.org/library/xml.etree.elementtree.html

from xml.etree.ElementTree import ElementTree

tree = ElementTree()
tree.parse('page.xhtml') #This is the XHTML provided in the OP
root = tree.getroot() #Returns the heading "table" element
print(root.tag) #"table"
for eachTableRow in root.getchildren(): 
    #root.getchildren() is a list of all of the <tr> elements
    #So we're going to loop over them and check their attributes
    if 'class' in eachTableRow.attrib:
        #Good to go. Now we know to look for the headline and link
        pass
    else:
        #Okay, so look for the date
        pass

这应该足以帮助你解决这个问题。

如何在Python中循环遍历html-table-dataset

2 个答案: