>>> text = '<a data-lecture-id="47"\n data-modal-iframe="https://class.coursera.org/neuralnets-2012-001/lecture/view?lecture_id=47"\n href="https://class.coursera.org/neuralnets-2012-001/lecture/47"\n data-modal=".course-modal-frame"\n rel="lecture-link"\n class="lecture-link">\nAnother diversion: The softmax output function [7 min]</a>'
>>> import re
>>> re.findall(r'data-lecture-id="(\d+)"|(.*)</a>',a)
>>> [('47', ''), ('', 'Another diversion: The softmax output function [7 min]')]
如何以这样的方式提取数据:
>>> ['47', 'Another diversion: The softmax output function [7 min]']
我认为应该有一些更聪明的正则表达式。
答案 0 :(得分:2)
您使用itertools
import re
from itertools import chain, ifilter
raw_found = re.findall(r'data-lecture-id="(\d+)"|(.*)</a>', text)
# simple
found = [x for x in chain(*raw_found) if x]
# or faster
found = [x for x in ifilter(None, chain(*raw_found))]
# or more compact, also just as fast
found = list(ifilter(None, chain(*raw_found)))
print found
输出:
['47', 'Another diversion: The softmax output function [7 min]']
答案 1 :(得分:2)
是not recommended to parse HTML with reguar expressions。您可以尝试xml.dom.minidom
模块:
from xml.dom.minidom import parseString
xml = parseString('<a data-lecture-id="47"\n data-modal-iframe="https://class.coursera.org/neuralnets-2012-001/lecture/view?lecture_id=47"\n href="https://class.coursera.org/neuralnets-2012-001/lecture/47"\n data-modal=".course-modal-frame"\n rel="lecture-link"\n class="lecture-link">\nAnother diversion: The softmax output function [7 min]</a>')
anchor = xml.getElementsByTagName("a")[0]
print anchor.getAttribute("data-lecture-id"), anchor.childNodes[0].data
答案 2 :(得分:0)
我自己找到了解决方案:
>>> re.findall('r'data-lecture-id="(\d+)"[\s\S]+>([\s\S]+)</a>',a)
>>> [('47', '\nAnother diversion: The softmax output function [7 min]')]
看起来更好,但仍需要迭代它以提取一个简单的列表......