我必须从输入HTML为类值创建dictinary。
输入:
<div>
<p id="quarter-line-below1" class="firstpara-rw"><span class="dropcap-image-rw print-exclude-rw"><img alt="2014" src="243864_20.png"/></span><span class="dropcap-rw">2014 </span>has had some .............</p>
<p id="firstpara1" class="firstpara-rw"><span class="dropcap-image-rw print-exclude-rw"><img alt="O" src="243864_69.png"/></span><span class="dropcap-rw">O</span>f course ...........</p>
<p class="test1-image-rw print-exclude-rw" id="ornament1-orn"><img src="243865_18.png" /></p>
<p class="test1-rw" id="ornament1">aA bB cC dD eE fF gG hH iI</p>
<p class="test2-image-rw print-exclude-rw" id="ornament1-orn"><img src="243865_18.png" /></p>
<p class="test22-rw" id="ornament1">aA bB cC dD eE fF gG hH iI</p>
</div>
我的算法是:
LXML
解析内容。-image-rw
xpath
的所有标记
-image-rw
的目标类值,其各自的值表示从类值中删除-image
。代码:
import lxml.html as PARSER
import time
start_time = time.time()
root = PARSER.fromstring(content)
target_tags = root.xpath("//*[contains(@class, '-image-rw')]")
valid_class = {}
#- Validation.
for i in target_tags:
target_class = [j.strip() for j in i.attrib["class"].split() if "-image-rw" in j][0].strip()
target_class_next = target_class.replace("-image-rw", "-rw")
try:
for j in i.getnext().attrib["class"].split():
print j
if j.strip()==target_class_next:
valid_class[target_class] = target_class_next
break
except KeyError:
print "Class value missing. ", i
print "Time:-", time.time() - start_time
print "Result:-", valid_class
输出:
Time:- 0.000622987747192
Result:- {'test1-image-rw': 'test1-rw', 'dropcap-image-rw': 'dropcap-rw'}
是否有任何其他Pythonic和Optimized方法可以获得以上结果?