我有一组非常大的数据(stackoverflow的一个数据转储),它完全处于原始和清理状态。
For example: </p>
为了便于阅读和使用,是否已经建立了将上述和类似内容转换回原始形式的方法?一个python脚本或函数调用偶然吗?
答案 0 :(得分:0)
这是我必须使用的解决方案才能使一切正常工作 - 请注意,HTML解析器并没有按照我的数据集执行所有操作
import html.parser
import string
import sys
# Amount of lines to put into a buffer before writing
BUFFER_SIZE_LINES = 1024
html_parser = html.parser.HTMLParser()
# Few HTML reserved chars that are not being cleaned up by HTMLParser
dict = {}
dict[ '"' ] = '"'
dict[ ''' ] = "'"
dict[ '&' ] = '&'
dict[ '<' ] = '<'
dict[ '>' ] = '>'
# Process the file
def ProcessLargeTextFile(fileIn, fileOut):
r = open(fileIn, "r")
w = open(fileOut, "w")
buff = ""
buffLines = 0
for lineIn in r:
lineOut = html_parser.unescape(lineIn)
for key, value in dict.items():
lineOut = lineOut.replace(key,value)
buffLines += 1
if buffLines >= BUFFER_SIZE_LINES:
w.write(buff)
buffLines = 1
buff = ""
buff += lineOut + "\n"
w.write(buff)
r.close()
w.close()
# Now run
ProcessLargeTextFile(sys.argv[1],sys.argv[2])