获得字符串来源:
string ="""
html,,
head,, profile http://gmpg.org/xfn/11 ,,
lang en-US ,,
title,, Some markright page.
,,title
,,head
"""
...必须解析为html:
<html>
<head profile="http://gmpg.org/xfn/11" lang="en-US">
<title>Some markright page</title>
</head>
我想用一个re.findall
传递解析它,如:
tagList = re.findall(
r'\s*([A-Z]?[a-z]+[0-9]?,,){1}' # Opening tag - has to be one
r'(.* ,,)*' # Attributes - could be more than one
r'(.*)?' # Content - could be one
r'(\s+,,[a-z]+[0-9]?)?' # Ending tag - could be one
, string )#, flags=re.S ) # can't make any use of DOTALL flag
for t in tagList :
n=0
for s in t :
n+=1
print "String group No:"+str(n)+" -> ", s.strip()
print "_"*10
......但只得到:
String group No:1 -> html,,
String group No:2 ->
String group No:3 ->
String group No:4 ->
__________
String group No:1 -> head,,
String group No:2 -> profile http://gmpg.org/xfn/11 ,,
String group No:3 ->
String group No:4 ->
__________
String group No:1 -> title,,
String group No:2 ->
String group No:3 -> Some markright page.
String group No:4 -> ,,title
请记住,我有来制作我自己的解析器,而上面提到的问题只是这个标记超集的一个设备,所以如果你可以&amp;想。感谢。
答案 0 :(得分:1)
这就是我的方式:
#!/usr/bin/python
import re
pat = re.compile(r'''
(?P<open> \b [^\W_]+ ) ,, |
,, (?P<close> [^\W_]+ ) \b |
(?P<attrName> \S+ ) [ ] (?P<attrValue> [^,\n]+ ) [ ] ,, |
(?P<textContent> [^,\s] (?: [^,] | , (?!,) )*? ) \s* (?=[^\W_]*,,)''',
re.X)
txt = '''html,,
head,, profile http://gmpg.org/xfn/11 ,,
lang en-US ,,
title,, Some markright page.
,,title
,,head'''
result = ''
opened = False
for m in pat.finditer(txt):
if m.group('attrName'):
result += ' ' + m.group('attrName') + '="' + m.group('attrValue') + '"'
else:
if opened:
opened = False
result += '>'
if m.group('open'):
result += '<' + m.group('open')
opened = True
elif m.group('close'):
result += '</' + m.group('close') + '>'
else:
result += m.group('textContent')
print result
注意:我假设文本内容始终包含在标签之间。