删除"字符串"包含使用正则表达式从CSV文件中的最后一行

时间:2017-02-26 01:40:51

标签: python csv

我是Python新手。我有成千上万的CSV文件,其中有一组文本在记录数字数据后出现,我想删除以文本开头的所有行。例如:

col 1    col 2    col 3
--------------------
10      20        30
--------------------
45      34        56
--------------------
Start   8837sec    9items
--------------------
Total   6342sec   755items

好处是所有csv文件的文本以" Start"开头。在column1中。我希望之后删除所有行,包括显示" Start"。

的行

这是我到目前为止所做的:

import csv, os, re, sys


fileList = []

pattern = [r"\b(Start).*", r"\b(Total).*"]

for file in files:
    fullname = os.path.join(cwd, file)

    if not os.path.isdir(fullname) and not os.path.islink(fullname):
        fileList.append(fullname)


for file in fileList:
    try:
        ifile = open(file, "r")
    except IOError:
        sys.stderr.write("File %s not found! Please check the filename." %(file))
        sys.exit()
    else:
        with ifile:
            reader = csv.reader(ifile)
            writer = csv.writer(ifile)
            rowList = []     
            for row in reader:
               rowList.append((", ".join(row)))

        for pattern in word_pattern:
             if not (re.match(pattern, rowList)
                writer.writerow(elem)

运行此脚本后,它会给我空白的csv文件。知道要改变什么吗?

2 个答案:

答案 0 :(得分:0)

您不需要CSV阅读器。您可以简单地找到偏移量并截断文件。以二进制模式打开文件并使用多行正则表达式在文本中查找模式并使用其索引。

import os
import re

# multiline, ascii only regex matches Start or Total at start of line
start_tag_finder = re.compile(rb'(?am)\nStart|\nTotal').search

for filename in files: # TODO: I'm not sure where "files" comes from...
    # NOTE: no need to join cwd, relative paths do that automatically
    if not os.path.isdir(filename) and not os.path.islink(filename):
        with open(filename, 'rb+') as f:
            # NOTE: you can cap file size if you'd like
            if os.stat(filename).st_size > 1000000:
                print(filename, "overflowed 10M size limit")
                continue
            search = start_tag_finder(f.read())
            if search:
                f.truncate(search.start())

答案 1 :(得分:0)

在你将fileList放在一起后,我会尝试这一切:

for file in fileList:
    keepRows = []
    open(file, 'r') as oFile:
    for row in oFile:
        if row[0] != "Start":
            keepRows += row
        else:
            oFile.close()
    with open(file, 'wb+') as nFile:
    writer = csv.writer(nFile, delimiter=',')
    writer.writerow([keepRows])

这将打开您的原始文件,获取您想要的行,关闭它并使用w+打开它。这将覆盖文件,保留名称,但通过truncate将其清除,然后将写入要保留在清除文件的每一行上的每一行。

或者,您可以为每个csv创建一个新文件:

for file in fileList:
    keepRows = []
    with open(file, 'r') as oFile, open('new_file.csv', 'a') as nFile:
    for row in oFile:
        if row[0] != "Start":
            keepRows += row
        else:
            oFile.close()
    for row in keepRows:
        nFile.write(row)

使用a打开,每次都将光标放在下一行,因为这是append。用户迭代之前的.writerow方法,这就是[] object中每个组的rowkeepRows append binary file mode .writer() 1}}不需要迭代,并将分组中的每个项目写入自己的列,移动到下一行并执行相同的操作。

编辑:更新了# input data x,y,z: x<-rep(c(1:100),times=10) y<-rep(c(1:10),each=100) z<-NULL for(i in 1:10){ n<-sample(c(10:30),1) m<-sample(c(50:70),1) z<-c(z,c(1:n,sample(c(50:100),100-m-n),c(m:1)))} # bilinear interpolation of irregular data: library('akima') interpo<-interp(x=x,y=y,z=z,xo=seq(min(x),max(x),length.out=1000),yo=seq(min(y),max(y),length.out=1000)) image(interpo) # convert data format for ggplot: x<-rep(interpo$x,times=1000) y<-rep(interpo$y,each=1000) r<-NULL for(i in 1:1000){ r<-c(r,interpo$z[,i]) } plo<-data.frame(x,y,r) # plot the countour plot: library(ggplot2) library(scales) ggplot(plo, aes(y = y, x = x, fill = r)) + geom_raster()+ scale_fill_gradient(low="blue",high="red",limits=c(min(r),max(r))) r的语法。