Python正则表达式的一些问题

时间:2014-08-15 13:45:59

标签: python regex string

获得字符串来源:

string ="""
html,, 
    head,,        profile http://gmpg.org/xfn/11 ,,
                  lang en-US ,,

        title,,   Some markright page.
        ,,title
    ,,head
"""

...必须解析为html:

<html>
<head profile="http://gmpg.org/xfn/11" lang="en-US">
<title>Some markright page</title>
</head>

我想用一个re.findall传递解析它,如:

tagList = re.findall( 
    r'\s*([A-Z]?[a-z]+[0-9]?,,){1}'   # Opening tag - has to be one
    r'(.* ,,)*'                       # Attributes - could be more than one
    r'(.*)?'                          # Content - could be one
    r'(\s+,,[a-z]+[0-9]?)?'           # Ending tag - could be one
    , string )#, flags=re.S )  # can't make any use of DOTALL flag   

for t in tagList :
    n=0
    for s in t :
        n+=1
        print "String group No:"+str(n)+" -> ", s.strip()
    print "_"*10

......但只得到:

String group No:1 ->  html,,
String group No:2 ->  
String group No:3 ->  
String group No:4 ->  
__________
String group No:1 ->  head,,
String group No:2 ->  profile http://gmpg.org/xfn/11 ,,
String group No:3 ->  
String group No:4 ->  
__________
String group No:1 ->  title,,
String group No:2 ->  
String group No:3 ->  Some markright page.
String group No:4 ->  ,,title

请记住,我来制作我自己的解析器,而上面提到的问题只是这个标记超集的一个设备,所以如果你可以&amp;想。感谢。

1 个答案:

答案 0 :(得分:1)

这就是我的方式:

#!/usr/bin/python
import re

pat = re.compile(r'''
    (?P<open> \b [^\W_]+ ) ,, |
    ,, (?P<close> [^\W_]+ ) \b |
    (?P<attrName> \S+ ) [ ] (?P<attrValue> [^,\n]+ ) [ ] ,, |
    (?P<textContent> [^,\s] (?: [^,] | , (?!,) )*? ) \s* (?=[^\W_]*,,)''',
      re.X)

txt = '''html,, 
    head,,        profile http://gmpg.org/xfn/11 ,,
                  lang en-US ,,

        title,,   Some markright page.
        ,,title
    ,,head'''

result = ''
opened = False
for m in pat.finditer(txt):
    if m.group('attrName'):
        result += ' ' + m.group('attrName') + '="' + m.group('attrValue') + '"'
    else:
        if opened:
            opened = False
            result += '>'
        if m.group('open'):
            result += '<' + m.group('open')
            opened = True
        elif m.group('close'):
            result += '</' + m.group('close') + '>'
        else:
            result += m.group('textContent')
print result

注意:我假设文本内容始终包含在标签之间。