Question

我有一组要清除的csv文件，然后将它们放入数据库中。这些文件用制表符描述，并有两种格式。一种格式如下：

Some text string

Field1\tField2\tField3\tField4

Some text string总是以相同的顺序开头，因此我想用它来标识需要修改的文件。从那里可以放下前两行（第一行和随后的空行）。

我已经能够成功找到以该字符串开头的文件，但是我只能通过遍历每一行来做到这一点，这并不是我尝试执行的最佳方法。

其中csvFiles是目录中的csv文件列表：

在csv模块中：

for csvFile in csvFiles:
    with open(csvFile, newline='') as f:
        for line in f:
            if line.startswith("Some"):
                print("Found it")

在大熊猫中：

for csvFile in csvFiles:
    standings = pandas.read_csv(csvFile, sep='/t', header=None, engine='python')
        for row in standings:
            if standings[row][0].startswith("Some"):
                print("Found it")

我想简单地选择第一行并使用if语句检查它，最好在熊猫中检查它，但是我没有成功。 pandas将第一行解释为标题，并为随后的每一行分配行索引，因此我无法按索引选择第一行。我试图设置header=None以便为每一行建立索引，但仍无法按索引选择第一行。

我试图弄清楚如何遍历csvFiles列表中的文件，找到以Some text string开头的文件，并仅从这些文件中删除前两行以及某些后来的行。 / p>

我理想的解决方案应该是这样的：

for csvFile in csvFiles:
    standings = pandas.read_csv(csvFile, sep='/t', header=None, engine='python')
        if standings[row][0].startswith("Some"):
            print("Found it")
            #do some stuff

Answer 1

由于您只是匹配一行文本，因此使用Pandas没有任何好处（事实上，它可能会更慢，更困难）。但是如果您小心的话，只能打开每个文件一次：

for csvFile in csvFiles:
    with open(csvFile) as f:
        line = f.readline()
        if line.startswith("Some"):
            f.readline() # skip one more line (validate it if you like)
            df = pd.read_csv(f, sep='\t', header=None)
            # now  you have the data you want

想法是将打开的文件句柄传递给read_csv()，并在您消耗掉不需要的“元数据”后继续读取。

您可能还希望将列名和/或类型指定为read_csv()，以便您的DataFrame看起来像您想要的样子，而无需进一步操作。提前指定dtype可以加快大型文件的解析速度。

使用csv和/或pandas模块在Python中删除行（数据整理）

1 个答案: