如何使用BeautifulSoup Python检索字符串日期列表

时间:2017-05-02 20:49:14

标签: python beautifulsoup

我想在使用BeautifulSoup创建一个小代码方面有所帮助,我想完成两件事。

首先,我希望只能检索下面代码中找到的日期,例如。 findAll arr,然后将其存储在列表中。我的问题是,因为有几个<td>02/02/2011</td>标签,我不知道如何具体检索日期字符串。

其次,在检索日期列表之后,我将有两个变量。一个变量只包含一个日期,另一个变量包含所有日期的列表。例如,tddate_lst_single = "<td>01/04/2011</td>"

最后,我希望能够创建某种Find条件,我将使用datetime依赖,但只需要它背后的逻辑。无论在{02} 2011年的date_lst_all = ["LIST OF ALL DATES"]中找到了什么日期,我只想找到它之前的日期。例如,在下面的代码中,此日期之前的前一天是date_lst = <td>02/02/2011</td>,然后检索日期并存储在新变量中。

<td>01/04/2011</td>

更新:

如果这可以帮助其他人,我要求的代码的最后部分如下。我加入了我在这篇文章中收到的帮助。

 <td>10/24/2011</td>,
 <td><span class="match">11</span> - <span class="match">12</span> - <span class="match">17</span> - 31 - 33</td>,
 <span class="match">11</span>,
 <span class="match">12</span>,
 <span class="match">17</span>,
 <td>8,210</td>,
 <td>$10.50</td>,
 <tr><td>06/15/2011</td><td>1 - 7 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">17</span></td><td>15,369</td><td>$6.50</td></tr>,
 <td>06/15/2011</td>,
 <td>1 - 7 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">17</span></td>,
 <span class="match">11</span>,
 <span class="match">12</span>,
 <span class="match">17</span>,
 <td>15,369</td>,
 <td>$6.50</td>,
 <tr class="alt"><td>01/04/2011</td><td><span class="match">11</span> - <span class="match">12</span> - <span class="match">20</span> - 21 - 27</td><td>10,752</td><td>$15.00</td></tr>,
 <td>01/04/2011</td>,
 <td><span class="match">11</span> - <span class="match">12</span> - <span class="match">20</span> - 21 - 27</td>,
 <span class="match">11</span>,
 <span class="match">12</span>,
 <span class="match">20</span>,
 <td>10,752</td>,
 <td>$15.00</td>,
 <tr><td>09/24/2009</td><td>2 - 3 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">17</span></td><td>11,406</td><td>$7.50</td></tr>,
 <td>09/24/2009</td>,
 <td>2 - 3 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">17</span></td>,
 <span class="match">11</span>,
 <span class="match">12</span>,
 <span class="match">17</span>,
 <td>11,406</td>,
 <td>$7.50</td>,
 <tr class="alt"><td>08/08/2009</td><td><span class="match">12</span> - <span class="match">20</span> - 26 - 28 - <span class="match">30</span></td><td>10,267</td><td>$11.00</td></tr>,
 <td>08/08/2009</td>,
 <td><span class="match">12</span> - <span class="match">20</span> - 26 - 28 - <span class="match">30</span></td>,
 <span class="match">12</span>,
 <span class="match">20</span>,
 <span class="match">30</span>,
 <td>10,267</td>,
 <td>$11.00</td>,
 <tr><td>05/05/2009</td><td>8 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">20</span> - 26</td><td>11,260</td><td>$8.00</td></tr>,
 <td>05/05/2009</td>,
 <td>8 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">20</span> - 26</td>,
 <span class="match">11</span>,
 <span class="match">12</span>,
 <span class="match">20</span>,
 <td>11,260</td>,
 <td>$8.00</td>,
 <tr class="alt"><td>04/07/2009</td><td>10 - <span class="match">11</span> - <span class="match">12</span> - 16 - <span class="match">17</span></td><td>11,163</td><td>$8.50</td></tr>,
 <td>04/07/2009</td>,
 <td>10 - <span class="match">11</span> - <span class="match">12</span> - 16 - <span class="match">17</span></td>,
 <span class="match">11</span>,
 <span class="match">12</span>,
 <span class="match">17</span>,
 <td>11,163</td>,
 <td>$8.50</td>,
 <tr><td>01/31/2009</td><td>3 - <span class="match">17</span> - <span class="match">20</span> - <span class="match">30</span> - 34</td><td>10,086</td><td>$11.50</td></tr>,
 <td>01/31/2009</td>,
 <td>3 - <span class="match">17</span> - <span class="match">20</span> - <span class="match">30</span> - 34</td>,
 <span class="match">17</span>,
 <span class="match">20</span>,
 <span class="match">30</span>,
 <td>10,086</td>,
 <td>$11.50</td>,
 <tr class="alt"><td>08/06/2008</td><td>4 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">30</span> - 32</td><td>9,497</td><td>$11.00</td></tr>,
 <td>08/06/2008</td>,
 <td>4 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">30</span> - 32</td>,
 <span class="match">11</span>,
 <span class="match">12</span>,
 <span class="match">30</span>,
 <td>9,497</td>,
 <td>$11.00</td>,

2 个答案:

答案 0 :(得分:0)

您可以通过这种方式恢复td代码中的日期。

>>> HTML = open('trial.htm').read()
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(HTML, 'lxml')
>>> import re
>>> def isDate(tag):
...     if tag.name=='td':
...         return bool(re.match('\d\d\/\d\d\/\d\d', tag.text))
...     else:
...         return False
... 
>>> tds = soup.find_all(isDate)
>>> tds
[<td>10/24/2011</td>, <td>06/15/2011</td>, <td>06/15/2011</td>, <td>01/04/2011</td>, <td>01/04/2011</td>, <td>09/24/2009</td>, <td>09/24/2009</td>, <td>08/08/2009</td>, <td>08/08/2009</td>, <td>05/05/2009</td>, <td>05/05/2009</td>, <td>04/07/2009</td>, <td>04/07/2009</td>, <td>01/31/2009</td>, <td>01/31/2009</td>, <td>08/06/2008</td>, <td>08/06/2008</td>]
>>> dates = [_.text for _ in tds]
>>> dates
['10/24/2011', '06/15/2011', '06/15/2011', '01/04/2011', '01/04/2011', '09/24/2009', '09/24/2009', '08/08/2009', '08/08/2009', '05/05/2009', '05/05/2009', '04/07/2009', '04/07/2009', '01/31/2009', '01/31/2009', '08/06/2008', '08/06/2008']

我并不完全确定你的问题的其余部分是什么意思。但是,我怀疑你可以从中收集你需要的东西。

答案 1 :(得分:0)

您可以使用lxml(或BeautifulSoup with lxml)来解析您的文件。然后使用XPath表达式选择所有<td>

from lxml import etree

tree = etree.parse(path)
td_nodes = tree.xpath('//td')

然后,您可以定义一个函数来标识具有有效日期的节点:

def is_date(node):
    try:
        return datetime.datetime.strptime(node.text, '%m/%d/%Y')
    except ValueError:
        return None

您可以使用此功能查找附加到td节点的日期:

nodes_with_date = {node: is_date(node) for node in td_nodes}

可以像这样提取包含日期的td列表:

date_list = [node for node, date in nodes_with_date.items()
             if date]

指定日期之前的td列表为:

date_list = [node for node, date in nodes_with_date.items()
             if date and date < given_date]