使用如下的pyparsing可以实现相反的目的:
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
removeText = replaceWith("")
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)
如何保留标记"table"
的内容?
更新0:
我试过了: #只保留表格 tableOpen,tableClose = makeHTMLTags(“table”) tableBody = tableOpen + SkipTo(tableClose)+ tableClose f = replaceWith(tableBody) tableBody.setParseAction(F) data =(tableBody).transformString(data) 打印数据
我得到这样的东西......
garbages
<input type="hidden" name="cassstx" value="en_US:frontend"></form></td></tr></table></span></td></tr></table>
{<"table"> SkipTo:(</"table">) </"table">}
<div id="asbnav" style="padding-bottom: 10px;">{<"table"> SkipTo:(</"table">) </"table">}
</div>
even more garbages
更新2:
谢谢Martelli。我需要的是:
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'
tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]
print thetable
答案 0 :(得分:1)
您可以先提取表格(类似于您现在提取脚本的方式,但当然没有删除;-),获取thetable
字符串;然后,您提取脚本replaceWith(thetable)
而不是replaceWith('')
。或者,您可以准备一个更精细的解析操作,但简单的两阶段方法对我来说更直接。例如。 (专门保留table
的内容,而不是table
标记):
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'
tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]
removeText = replaceWith(thetable)
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)
print data
这会打印beforebuhafter
(脚本标记之外的内容,表格标签的内容夹在里面),希望“按照需要”。