我想在使用BeautifulSoup创建一个小代码方面有所帮助,我想完成两件事。
首先,我希望只能检索下面代码中找到的日期,例如。 findAll arr
,然后将其存储在列表中。我的问题是,因为有几个<td>02/02/2011</td>
标签,我不知道如何具体检索日期字符串。
其次,在检索日期列表之后,我将有两个变量。一个变量只包含一个日期,另一个变量包含所有日期的列表。例如,td
和date_lst_single = "<td>01/04/2011</td>"
。
最后,我希望能够创建某种Find条件,我将使用datetime依赖,但只需要它背后的逻辑。无论在{02} 2011年的date_lst_all = ["LIST OF ALL DATES"]
中找到了什么日期,我只想找到它之前的日期。例如,在下面的代码中,此日期之前的前一天是date_lst = <td>02/02/2011</td>
,然后检索日期并存储在新变量中。
<td>01/04/2011</td>
更新:
如果这可以帮助其他人,我要求的代码的最后部分如下。我加入了我在这篇文章中收到的帮助。
<td>10/24/2011</td>,
<td><span class="match">11</span> - <span class="match">12</span> - <span class="match">17</span> - 31 - 33</td>,
<span class="match">11</span>,
<span class="match">12</span>,
<span class="match">17</span>,
<td>8,210</td>,
<td>$10.50</td>,
<tr><td>06/15/2011</td><td>1 - 7 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">17</span></td><td>15,369</td><td>$6.50</td></tr>,
<td>06/15/2011</td>,
<td>1 - 7 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">17</span></td>,
<span class="match">11</span>,
<span class="match">12</span>,
<span class="match">17</span>,
<td>15,369</td>,
<td>$6.50</td>,
<tr class="alt"><td>01/04/2011</td><td><span class="match">11</span> - <span class="match">12</span> - <span class="match">20</span> - 21 - 27</td><td>10,752</td><td>$15.00</td></tr>,
<td>01/04/2011</td>,
<td><span class="match">11</span> - <span class="match">12</span> - <span class="match">20</span> - 21 - 27</td>,
<span class="match">11</span>,
<span class="match">12</span>,
<span class="match">20</span>,
<td>10,752</td>,
<td>$15.00</td>,
<tr><td>09/24/2009</td><td>2 - 3 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">17</span></td><td>11,406</td><td>$7.50</td></tr>,
<td>09/24/2009</td>,
<td>2 - 3 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">17</span></td>,
<span class="match">11</span>,
<span class="match">12</span>,
<span class="match">17</span>,
<td>11,406</td>,
<td>$7.50</td>,
<tr class="alt"><td>08/08/2009</td><td><span class="match">12</span> - <span class="match">20</span> - 26 - 28 - <span class="match">30</span></td><td>10,267</td><td>$11.00</td></tr>,
<td>08/08/2009</td>,
<td><span class="match">12</span> - <span class="match">20</span> - 26 - 28 - <span class="match">30</span></td>,
<span class="match">12</span>,
<span class="match">20</span>,
<span class="match">30</span>,
<td>10,267</td>,
<td>$11.00</td>,
<tr><td>05/05/2009</td><td>8 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">20</span> - 26</td><td>11,260</td><td>$8.00</td></tr>,
<td>05/05/2009</td>,
<td>8 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">20</span> - 26</td>,
<span class="match">11</span>,
<span class="match">12</span>,
<span class="match">20</span>,
<td>11,260</td>,
<td>$8.00</td>,
<tr class="alt"><td>04/07/2009</td><td>10 - <span class="match">11</span> - <span class="match">12</span> - 16 - <span class="match">17</span></td><td>11,163</td><td>$8.50</td></tr>,
<td>04/07/2009</td>,
<td>10 - <span class="match">11</span> - <span class="match">12</span> - 16 - <span class="match">17</span></td>,
<span class="match">11</span>,
<span class="match">12</span>,
<span class="match">17</span>,
<td>11,163</td>,
<td>$8.50</td>,
<tr><td>01/31/2009</td><td>3 - <span class="match">17</span> - <span class="match">20</span> - <span class="match">30</span> - 34</td><td>10,086</td><td>$11.50</td></tr>,
<td>01/31/2009</td>,
<td>3 - <span class="match">17</span> - <span class="match">20</span> - <span class="match">30</span> - 34</td>,
<span class="match">17</span>,
<span class="match">20</span>,
<span class="match">30</span>,
<td>10,086</td>,
<td>$11.50</td>,
<tr class="alt"><td>08/06/2008</td><td>4 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">30</span> - 32</td><td>9,497</td><td>$11.00</td></tr>,
<td>08/06/2008</td>,
<td>4 - <span class="match">11</span> - <span class="match">12</span> - <span class="match">30</span> - 32</td>,
<span class="match">11</span>,
<span class="match">12</span>,
<span class="match">30</span>,
<td>9,497</td>,
<td>$11.00</td>,
答案 0 :(得分:0)
您可以通过这种方式恢复td
代码中的日期。
>>> HTML = open('trial.htm').read()
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(HTML, 'lxml')
>>> import re
>>> def isDate(tag):
... if tag.name=='td':
... return bool(re.match('\d\d\/\d\d\/\d\d', tag.text))
... else:
... return False
...
>>> tds = soup.find_all(isDate)
>>> tds
[<td>10/24/2011</td>, <td>06/15/2011</td>, <td>06/15/2011</td>, <td>01/04/2011</td>, <td>01/04/2011</td>, <td>09/24/2009</td>, <td>09/24/2009</td>, <td>08/08/2009</td>, <td>08/08/2009</td>, <td>05/05/2009</td>, <td>05/05/2009</td>, <td>04/07/2009</td>, <td>04/07/2009</td>, <td>01/31/2009</td>, <td>01/31/2009</td>, <td>08/06/2008</td>, <td>08/06/2008</td>]
>>> dates = [_.text for _ in tds]
>>> dates
['10/24/2011', '06/15/2011', '06/15/2011', '01/04/2011', '01/04/2011', '09/24/2009', '09/24/2009', '08/08/2009', '08/08/2009', '05/05/2009', '05/05/2009', '04/07/2009', '04/07/2009', '01/31/2009', '01/31/2009', '08/06/2008', '08/06/2008']
我并不完全确定你的问题的其余部分是什么意思。但是,我怀疑你可以从中收集你需要的东西。
答案 1 :(得分:0)
您可以使用lxml(或BeautifulSoup with lxml)来解析您的文件。然后使用XPath表达式选择所有<td>
from lxml import etree
tree = etree.parse(path)
td_nodes = tree.xpath('//td')
然后,您可以定义一个函数来标识具有有效日期的节点:
def is_date(node):
try:
return datetime.datetime.strptime(node.text, '%m/%d/%Y')
except ValueError:
return None
您可以使用此功能查找附加到td
节点的日期:
nodes_with_date = {node: is_date(node) for node in td_nodes}
可以像这样提取包含日期的td
列表:
date_list = [node for node, date in nodes_with_date.items()
if date]
指定日期之前的td
列表为:
date_list = [node for node, date in nodes_with_date.items()
if date and date < given_date]