！在/ usr / bin中/ python3

Question

我有一组非常大的数据（stackoverflow的一个数据转储），它完全处于原始和清理状态。

For example:  &lt;/p&gt;

为了便于阅读和使用，是否已经建立了将上述和类似内容转换回原始形式的方法？一个python脚本或函数调用偶然吗？

Answer 1

这是我必须使用的解决方案才能使一切正常工作 - 请注意，HTML解析器并没有按照我的数据集执行所有操作

！在/ usr / bin中/ python3

    import html.parser
    import string
    import sys

    # Amount of lines to put into a buffer before writing
    BUFFER_SIZE_LINES = 1024
    html_parser = html.parser.HTMLParser()

    # Few HTML reserved chars that are not being cleaned up by HTMLParser
    dict = {}
    dict[ '&quot;' ] = '"'
    dict[ '&apos;' ] = "'"
    dict[ '&amp;' ] = '&'
    dict[ '&lt;' ] = '<'
    dict[ '&gt;' ] = '>'

    # Process the file
    def ProcessLargeTextFile(fileIn, fileOut):
        r = open(fileIn, "r")
        w = open(fileOut, "w")
        buff = ""
        buffLines = 0
        for lineIn in r:

            lineOut = html_parser.unescape(lineIn)
            for key, value in dict.items():
                lineOut = lineOut.replace(key,value)

            buffLines += 1

            if buffLines >= BUFFER_SIZE_LINES:
                w.write(buff)
                buffLines = 1
                buff = ""

            buff += lineOut + "\n"

        w.write(buff)
        r.close()
        w.close()


    # Now run
    ProcessLargeTextFile(sys.argv[1],sys.argv[2])

转换消毒数据的最佳方法是什么？

1 个答案:

！在/ usr / bin中/ python3