Question

我想将输入数据转换为CSV，以便将WEKA用于数据挖掘过程。我不知道为什么我在程序的输出中丢失了一些字段。我认为问题出在程序的后半部分，当我在新文件中写入输出时，我错过了一些分隔符，这使得WEKA无法处理文件。

我留下代码和输入文件。

PYTHON：

#!usr/bin/env python
# -*- coding: utf-8 -*-

f = open('datos_terr.csv', 'rb')
fout = open('salida.csv', 'w')
lines = f.readlines()
first = lines[0].strip("\r\n")
fout.write(lines[0] + "\n")
for line in lines[1:]:
    """
    Removing tab characters, used to separate the values. Then I insert NULL values between them for uknown fields. I add "" characters to strings 
    to make WEKA able to accept them, and I put the separation value. I remove the ending tabs and they are subtituted by commas. I write the line
    to the output file and close both.

    Elimino los caracteres de tabulación, que son los que representan la separación. Luego los separo por ellos tras añadir el NULL para
    los campos de los que no conozco los datos. Añado comillas a las cadenas de texto para que WEKA las acepte y añado el caracter de separación.
    Elimino las tabulaciones que me sobren al final y luego los sustituyo por comas. Las escribo al fichero de salida y cierro ambos.
    """
    line = line.strip("\r\n")
    line = line.replace("'", "")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t\t", "\tNULL\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t\t", "\tNULL\tNULL\tNULL\t")
    line = line.replace("\t\t\t", "\tNULL\tNULL\t")
    line = line.replace("\t\t", "\tNULL\t")
    new_line = ""
    data = line.split("\t")
    for word in data:
        word = word.strip(" ")
        word = word.replace(" ", "")
        if word.isspace():
            word = "NULL"
        if "," in word:
            new_line += '"' + word + '"'
        else:
            if not word.isdigit() and not word == "NULL" and not isinstance(word, float) and not word == "":
                new_line += '"' + word + '"\t'
            else:
                new_line += word + "\t"
    new_line = new_line.strip('\t')
    new_line = new_line.replace("\t", ",")
    fout.write(new_line + "\n")
f.close()
fout.close()

可以在此URL中查看输入文件：

https://drive.google.com/file/d/0B9PJivXVcFu8c3FLYmFpX0RaVnM/view?usp=sharing

Answer 1

我会使用csv模块获取字段列表，并使用它们。更清晰的代码通常更容易找到错误。你可以在不使用csv模块的情况下做同样的事情，但是该模块已经可以说几种不同的格式 - 例如，它会自动引用具有分隔符的字段，所以你不要需要进行if "," in word:检查。您还可以通过一个简单的选项查看文档，看看是否为您处理了任何其他检查：https://docs.python.org/2/library/csv.html

您的代码为每一行创建一个新字符串，所以我刚刚为每一行创建了一个新列表，作为编写代码的等效方式：

with open('datos_terr.csv', 'rb') as incsv, open('salida.csv', 'wb') as outcsv:
    # Read from the first, saying that tab is the field delimiter
    myreader = csv.reader(incsv, delimiter='\t')
    # , is the default, here for explanation
    mywriter = csv.writer(outcsv, delimiter=',')
    for row in myreader:
        # row is a list of the fields.
        newrow = list()
        for field in row:
            # No spaces allowed in fields
            field = field.strip()
            field = field.replace(' ', '')
            # single quotes to be removed, as per original code
            field = field.replace("'", '')
            if len(field) < 1:
                field = 'NULL'
            newrow.append(field)
        mywriter.writerow(newrow)
        # print ', '.join(newrow)

为什么我在文件中丢失字段？

1 个答案: