Question

我正在使用Python将word文件转换为文本字符串。生成的文本字符串将Bullet点（在word文件中）转换为（在转换后的字符串中）。如何使用Python从文本字符串中删除它，以便我只有没有这些框的文本字符串（）。

from docx import Document

document = Document(file_to_read)

text_string = ''
for paragraph in document.paragraphs:
    text_string += paragraph.text+"\n"# -*- coding: utf-8 -*-

print text_string

输出如下：

 Computer Science fundamentals in data structures.

 Computer Science fundamentals in algorithm design, problem solving, and complexity analysis

Answer 1

您的尝试并未尝试删除字符。您可以使用replace方法替换字符串中的字符，也可以通过替换空字符串来删除字符。

唯一的问题是在源代码中正确表示0xF0B7，正确的方法取决于document.paragraphs是否包含普通字符串或unicode字符串（我建议使用python3来避免unicode问题）。我假设它们是unicode字符串然后你将代码点表示为“u＆＃34; \ uF0B7＆＃34; （如果它的正常字符串则取决于编码）。

除了您的代码存在问题，因为您构建text_string的方式可能不是最理想的。从片段构建字符串的另一种方法是将片段放在列表中，然后使用"".join(l)将它们连接起来。

将这些放在一起（假设document.paragraphs是unicode字符串）：

from docx import Document

document = Document(file_to_read)

text_string = u"\n".join([p.replace(u"\uF0B7", u"") 
                          for p in document.paragraphs])

print(text_string)

如果使用python3，则必须在字符串之前删除u s（因为在python3中所有字符串都是unicode）。另请注意，print时，您必须确保您拥有支持文档中所有字符的编码（这可能是您首先要删除项目符号的原因）。

Answer 2

如果您只想要英文字符，可以这样做：

text_string = text_string.decode('ascii', errors='ignore')

我认为最好的解决方案是确切地确定导致问题的字节和replace。

此# -*- coding: utf-8 -*-指定源文件的编码，而不是字符串的编码。

从文本中删除

2 个答案: