如何清除Python 3没有外部模块的文本块?

时间:2018-10-16 19:54:22

标签: python python-3.x sanitization input-sanitization

最近被设置为一名黑客,我无法在不破坏Python 3中文本的情况下从标签中正确清除文本块。

提供了两个示例输入(如下),而挑战在于清除它们以使其成为安全的普通文本块。完成挑战的时间已经过去,但是我很困惑如何获得如此简单,如此错误的东西。对于我应该如何做的任何帮助,将不胜感激。

测试输入一

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. <script>
var y=window.prompt("Hello")
window.alert(y)
</script>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

测试输入2

In-text references or citations are used to acknowledge the work or ideas of others. They are placed next to the text that you have paraphrased or quoted, enabling the reader to differentiate between your writing and other people’s work.  The full details of your in-text references, <script language="JavaScript">
document.write("Page. Last update:" + document.lastModified); </script>When quoting directly from the source include the page number if available and place quotation marks around the quote, e.g. 
The World Health Organisation defines driver distraction ‘as when some kind of triggering event external to the driver results in the driver shifting attention away from the driving task’.

测试建议的输出1

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

测试建议的输出2

  In-text references or citations are used to acknowledge the work or ideas of others. They are placed next to the text that you have paraphrased or quoted, enabling the reader to differentiate between your writing and other people’s work. The full details of your in-text references, When quoting directly from the source include the page number if available and place quotation marks around the quote, e.g. The World Health Organisation defines driver distraction ‘as when some kind of triggering event external to the driver results in the driver shifting attention away from the driving task’.

谢谢!

编辑(使用@YakovDan的消毒方法): 代码:

def sanitize(inp_str):

    ignore_flag =False
    close_tag_count = 0


    out_str =""
    for c in inp_str:
        if not ignore_flag:
           if c == '<':
               close_tag_count=2
               ignore_flag=True
           else:
               out_str+=c
        else:
            if c == '>':
                close_tag_count-=1

            if close_tag_count == 0:
                ignore_flag=False


    return out_str

inp=input()
print(sanitize(inp))

输入:

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. <script>
 var y=window.prompt("Hello")
 window.alert(y)
 </script>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

输出:

早已确定的事实是,在查看页面布局时,读者会被页面的可读内容分心。使用Lorem Ipsum的要点是,它具有大致的字母正态分布,而不是使用“这里的内容,这里的内容”,使它看起来像可读的英语。现在,许多桌面出版软件包和网页编辑器都使用Lorem Ipsum作为默认模型文本,并且搜索“ lorem ipsum”将发现许多仍处于起步阶段的网站。

输出应为:

早已确定的事实是,在查看页面布局时,读者会被页面的可读内容分心。使用Lorem Ipsum的要点是,它具有大致的字母正态分布,而不是使用“这里的内容,这里的内容”,使它看起来像可读的英语。现在,许多桌面出版软件包和网页编辑器都使用Lorem Ipsum作为默认模型文本,而搜索“ lorem ipsum”将发现仍处于起步阶段的许多网站。与流行观点相反,Lorem Ipsum不仅仅是简单的文本。它起源于公元前45年的古典拉丁文学作品,距今已有2000多年的历史。弗吉尼亚州汉普顿-悉尼学院的拉丁裔教授理查德·麦克林托克从洛雷姆·伊普森的一段话中查找了一个较为晦涩的拉丁词consectetur。

2 个答案:

答案 0 :(得分:0)

通常,正则表达式是解析HTML标签(see here)的错误工具,但由于标签很简单-如果您使用的是非常规标签(没有标签的标签),则正则表达式将适用于此工作关闭标签等)输入,将失败。

话虽如此,对于这两个示例,您可以使用this regex

<.*?>.*?<\s*?\/.*?>

在Python中实现

import re
s = one of your long strings
r = re.sub('<.*?>.*?<\s*?\/.*?>', '', s, flags=re.DOTALL)
print(r)

给出了预期的结果(太过冗长而无法复制!)。

答案 1 :(得分:0)

这是不使用正则表达式的一种方法。

def sanitize(inp_str):

    ignore_flag =False
    close_tag_count = 0


    out_str =""
    for c in inp_str:
        if not ignore_flag:
           if c == '<':
               close_tag_count=2
               ignore_flag=True
           else:
               out_str+=c
        else:
            if c == '>':
                close_tag_count-=1

            if close_tag_count == 0:
                ignore_flag=False


     return out_str

这应该做到(直到有关标签的假设为止)