我从网址获取值。
import urllib2
response = urllib2.urlopen('url')
response.read()
它给了我太长的字符串类型输出,但我只是把我的问题放在这里。
STRING TYPE OUTPUT:
'<p>Dear Customer,</p>
<p>This notice serves as proof of delivery for the shipment listed below.</p>
<dl class="outHozFixed clearfix"><label>Weight:</label></dt><dd>18.00 lbs</dd>
<dt><label>Shipped/Billed On:</label></dt><dd>09/11/2015</dd>
<dt><label>Delivered On:</label></dt><dd>09/14/2015 11:07 A.M.</dd>
<dt><label for="">Signed By:</label></dt><dd>Odedra</dd></dt>
<dt><label>Left At:</label></dt>
<dd>Office</dd></dl><p>Thank you for giving us this opportunity to serve you.</p>'
问题:
我如何约会(2015年9月14日上午11:07)为 Delivered On 分配?
答案 0 :(得分:6)
您可以先使用Beautiful Soup或其他一些html解析器。它可能看起来像这样:
from bs4 import BeautifulSoup
import urllib2
response = urllib2.urlopen('url')
html = response.read()
soup = BeautifulSoup(html)
datestr = soup.find("label", text="Delivered On:").find_parent("dt").find_next_sibling("dd").string
如果您需要,一旦掌握了日期字符串,就可以使用strptime将其转换为日期时间对象。
import datetime
date = datetime.datetime.strptime(datestr, "%mm/%dd/%Y %I:%M %p")
请记住 - 您通常不会发现自己使用正则表达式解析HTML或XML ...
答案 1 :(得分:1)
试试这段代码:
import re
text = '''<p>Dear Customer,</p>
<p>This notice serves as proof of delivery for the shipment listed below.</p>
<dl class="outHozFixed clearfix"><label>Weight:</label></dt>
<dd>18.00 lbs</dd>
<dt><label>Shipped/Billed On:</label></dt>
<dd>09/11/2015</dd>
<dt><label>Delivered On:</label></dt><dd>09/14/2015 11:07 A.M.</dd>
<dt><label for="">Signed By:</label></dt><dd>Odedra</dd></dt>
<dt><label>Left At:</label></dt>
<dd>Office</dd></dl><p>Thank you for giving us this opportunity to serve you.</p>'''
re.findall(r'<dt><label>Delivered On:<\/label><\/dt><dd>([0-9\.\/\s:APM]+)', text)
输出:
['09/14/2015 11:07 A.M.']
答案 2 :(得分:1)
仅基于该输出,我会使用re
和re.search
。创建一个用于查找时间日期的正则表达式,如下所示:
import re
output = '''<p>Dear Customer,</p>
<p>This notice serves as proof of delivery for the shipment listed below.</p>
<dl class="outHozFixed clearfix"><label>Weight:</label></dt><dd>18.00 lbs</dd>
<dt><label>Shipped/Billed On:</label></dt><dd>09/11/2015</dd>
<dt><label>Delivered On:</label></dt><dd>09/14/2015 11:07 A.M.</dd>
<dt><label for="">Signed By:</label></dt><dd>Odedra</dd></dt>
<dt><label>Left At:</label></dt>
<dd>Office</dd></dl><p>Thank you for giving us this opportunity to serve you.</p>'''
pattern = '\d{2}/\d{2}/\d{4} \d{1,2}:\d{2} [A|P]\.M\.'
result = re.search(pattern, text, re.MULTILINE).group(0)
答案 3 :(得分:1)
如果你不喜欢正则表达式和第三方库,你总是可以使用老式的硬编码单行解决方案:
start_index = input_text.index("Delivered On:")+len("Delivered On:</label></dt><dd>")
stop_index = start_index + 21
text_date = input_text[start_index:stop_index]
对于一行案例:
{{1}}
因为您的问题的任何解决方案都是不同类型的硬编码:(
答案 4 :(得分:1)
试试这段代码:
import re
a = """<p>Dear Customer,</p><p>This notice serves as proof of delivery for the shipment listed below.</p><dl class="outHozFixed clearfix"><label>Weight:</label></dt><dd>18.00 lbs</dd><dt><label>Shipped/Billed On:</label></dt><dd>09/11/2015</dd><dt><label>Delivered On:</label></dt><dd>12/4/2015 11:07 A.M.</dd><dt><label for="">Signed By:</label></dt><dd>Odedra</dd></dt><dt><label>Left At:</label></dt><dd>Office</dd></dl><p>Thank you for giving us this opportunity to serve you.</p>"""
data = re.search('Delivered On:</label></dt><dd>(.*)$',a)
if data and data.group(1)[:1].isdigit():
data.group(1)[:20]