Question

我做了一个大型数据库的mysqldump，大约300MB。但是它出错了，它没有转义任何<o:p>...</o:p>标签中包含的任何引号。这是一个示例：

...Text here\' escaped correctly, <o:p> But text in here isn't. </o:p> Out here all\'s well again...

是否有可能编写一个脚本（最好是在Python中，但我会采取任何措施！），它能够自动扫描并修复这些错误？它们中有很多，Notepad ++无法很好地处理那么大的文件......

Answer 1

如果您的文件被划分为“行”是合理的长度，并且其中没有“阅读为文本”的二进制序列会中断，您可以使用fileinput的方便“相信我“正在重写文件”功能：

   import re
   import fileinput

   tagre = re.compile(r"<o:p>.*?</o:p>")
   def sub(mo):
     return mo.group().replace(r"'", r"\'")

   for line in fileinput.input('thefilename', inplace=True):
     print tagre.sub(sub, line),

如果没有，您必须自己模拟“就地重写”，例如（简单化...）：

   with open('thefilename', 'rb') as inf:
     with open('fixed', 'wb') as ouf:
       while True:
         b = inf.read(1024*1024)
         if not b: break
         ouf.write(tagre.sub(sub, b))

然后移动'fixed'以取代'thefilename'（代码或手动），如果您需要在修复后保留该文件名。

这过于简单，因为其中一个关键的<o:p> ... </o:p>部分最终可能会在两个连续的兆字节“块”之间被分割，因此无法识别（在第一个示例中，我假设每个部分始终完全包含在“行”内 - 如果不是这种情况那么你不应该使用该代码，但无论如何都要使用以下代码。解决这个问题需要更复杂的代码...：

   with open('thefilename', 'rb') as inf:
     with open('fixed', 'wb') as ouf:
       while True:
         b = getblock(inf)
         if not b: break
         ouf.write(tagre.sub(sub, b))

与例如

   partsofastartag = '<', '<o', '<o:', '<o:p'
   def getblock(inf):
     b = ''
     while True:
       newb = inf.read(1024 * 1024)
       if not newb: return b
       b += newb
       if any(b.endswith(p) for p in partsofastartag):
         continue
       if b.count('<o:p>') != b.count('</o:p>'):
         continue
       return b

如您所见，这是非常精巧的代码，因此，如果未经测试，我无法知道它对您的问题是正确的。特别是，有<o:p>的情况是否与结束</o:p>不匹配，反之亦然？如果是这样，那么调用getblock可能会以相当昂贵的方式返回整个文件，甚至RE匹配和替换也可能适得其反（如果这些标签中的一些单引号也会出现后者）已被正确转义，但不是全部。）

如果你至少有一个GB左右，那么至少避免块分割的微妙问题是可行的，因为一切都应该适合内存，使代码更简单：

   with open('thefilename', 'rb') as inf:
     with open('fixed', 'wb') as ouf:
         b = inf.read()
         ouf.write(tagre.sub(sub, b))

但是，上面提到的其他问题（可能的不平衡开/关标签等）可能仍然存在 - 只有您可以研究现有的有缺陷的数据，看看它是否提供了这样一个相当简单的修复方法！

转义某些html标记中包含的引号

1 个答案: