按类值匹配邻居标记

时间:2015-05-25 10:38:38

标签: python html xpath lxml

我必须从输入HTML为类值创建dictinary。

输入

  <div>
     <p id="quarter-line-below1" class="firstpara-rw"><span class="dropcap-image-rw print-exclude-rw"><img alt="2014" src="243864_20.png"/></span><span class="dropcap-rw">2014 </span>has had some .............</p>
     <p id="firstpara1" class="firstpara-rw"><span class="dropcap-image-rw print-exclude-rw"><img alt="O" src="243864_69.png"/></span><span class="dropcap-rw">O</span>f course ...........</p>
     <p class="test1-image-rw print-exclude-rw" id="ornament1-orn"><img src="243865_18.png" /></p>
     <p class="test1-rw" id="ornament1">aA bB cC dD eE fF gG hH iI</p>
     <p class="test2-image-rw print-exclude-rw" id="ornament1-orn"><img src="243865_18.png" /></p>
     <p class="test22-rw" id="ornament1">aA bB cC dD eE fF gG hH iI</p>
 </div>

我的算法是

  1. LXML解析内容。
  2. 通过-image-rw
  3. 获取类值包含xpath的所有标记
  4. 对步骤2中的每个标记进行迭代。
  5. 获取包含-image-rw的目标类值,其各自的值表示从类值中删除-image
  6. 获取目标代码的下一个标记。
  7. 检查目标值是否存在于网络标记中。
  8. 如果有,则添加到词典。
  9. 代码

    import lxml.html as PARSER
    import time
    
    start_time = time.time()
    root = PARSER.fromstring(content)
    target_tags = root.xpath("//*[contains(@class, '-image-rw')]")
    valid_class = {}
    #- Validation.
    for i in target_tags:
        target_class = [j.strip() for j in  i.attrib["class"].split() if "-image-rw" in j][0].strip()
        target_class_next = target_class.replace("-image-rw", "-rw")
        try:
           for j in i.getnext().attrib["class"].split():
               print j
               if j.strip()==target_class_next:
                   valid_class[target_class] = target_class_next
                   break
        except KeyError:
            print "Class value missing. ", i
    
    print "Time:-", time.time() - start_time
    print "Result:-", valid_class
    

    输出

    Time:- 0.000622987747192
    Result:- {'test1-image-rw': 'test1-rw', 'dropcap-image-rw': 'dropcap-rw'}
    

    是否有任何其他Pythonic和Optimized方法可以获得以上结果?

0 个答案:

没有答案