奇怪的符号/编码出现在输出Python txt文件中

时间:2014-01-08 21:42:11

标签: python parsing encoding

我从Python输出文本文件时遇到了令人沮丧的问题。实际上,当在文本编辑器中打开时,文件看起来完全正常,但是我将这些文件上传到QDA矿工,一个数据分析套件,一旦上传到QDA矿工,这就是文本的样子:

. 

"This problem really needs to be focused in a way that is particular to its cultural dynamics and tending in the industry,"

正如你所看到的,许多这些奇怪的(“)符号出现在整个文本中。我的python脚本最初解析的文本是一个RTF文件,我使用OSX的内置文本编辑器将其转换为纯文本。

有没有简单的方法来删除这些符号?我正在解析单个100 + mb文本文件并将它们分成数千个单独的文章,我必须有一种批量转换它们的方法,否则它几乎是不可能的。我还要提一下,这些文本文件的来源是从网页上复制的。

以下是我写的脚本中的一些相关代码:

test1 = filedialog.askopenfile()
newFolder = ((str(test1)[25:])[:-32])
folderCreate(newFolder)
masterFileName = newFolder+"/"+"MASTER_FILE"
masterOutput = open(masterFileName,"w")
edit = test1.readlines()
for i,line in enumerate(edit):
    for j in line.split():
        if j in ["Author","Author:"]:
            try:
                outputFileName = "-".join(edit[i-2].lower().title().split())+".txt"
                output = open(newFolder+"/"+outputFileName,"w") # create file with article name # backslashed changed to front slash windows
                print("File created - ","-".join(edit[i-2].lower().title().split()))
                counter2 = counter2+1
            except:
                print("Filename error.")
                counter = counter+1
                pass


            #Count number of words in each article
            wordCount = 0
            for word in edit[i+1].split():
                wordCount+=1
            fileList.append((outputFileName,str(wordCount)))

            #Now write to file
            output.write(edit[i-2])
            output.write("\n")
            author = line
            output.write(author) # write article author
            output.write("\n")
            output.write("\n")
            content = edit[i+1]
            output.write(content) # write article content

由于

0 个答案:

没有答案