我是Python新手。我有成千上万的CSV文件,其中有一组文本在记录数字数据后出现,我想删除以文本开头的所有行。例如:
col 1 col 2 col 3
--------------------
10 20 30
--------------------
45 34 56
--------------------
Start 8837sec 9items
--------------------
Total 6342sec 755items
好处是所有csv文件的文本以" Start"开头。在column1中。我希望之后删除所有行,包括显示" Start"。
的行这是我到目前为止所做的:
import csv, os, re, sys
fileList = []
pattern = [r"\b(Start).*", r"\b(Total).*"]
for file in files:
fullname = os.path.join(cwd, file)
if not os.path.isdir(fullname) and not os.path.islink(fullname):
fileList.append(fullname)
for file in fileList:
try:
ifile = open(file, "r")
except IOError:
sys.stderr.write("File %s not found! Please check the filename." %(file))
sys.exit()
else:
with ifile:
reader = csv.reader(ifile)
writer = csv.writer(ifile)
rowList = []
for row in reader:
rowList.append((", ".join(row)))
for pattern in word_pattern:
if not (re.match(pattern, rowList)
writer.writerow(elem)
运行此脚本后,它会给我空白的csv文件。知道要改变什么吗?
答案 0 :(得分:0)
您不需要CSV阅读器。您可以简单地找到偏移量并截断文件。以二进制模式打开文件并使用多行正则表达式在文本中查找模式并使用其索引。
import os
import re
# multiline, ascii only regex matches Start or Total at start of line
start_tag_finder = re.compile(rb'(?am)\nStart|\nTotal').search
for filename in files: # TODO: I'm not sure where "files" comes from...
# NOTE: no need to join cwd, relative paths do that automatically
if not os.path.isdir(filename) and not os.path.islink(filename):
with open(filename, 'rb+') as f:
# NOTE: you can cap file size if you'd like
if os.stat(filename).st_size > 1000000:
print(filename, "overflowed 10M size limit")
continue
search = start_tag_finder(f.read())
if search:
f.truncate(search.start())
答案 1 :(得分:0)
在你将fileList放在一起后,我会尝试这一切:
for file in fileList:
keepRows = []
open(file, 'r') as oFile:
for row in oFile:
if row[0] != "Start":
keepRows += row
else:
oFile.close()
with open(file, 'wb+') as nFile:
writer = csv.writer(nFile, delimiter=',')
writer.writerow([keepRows])
这将打开您的原始文件,获取您想要的行,关闭它并使用w+
打开它。这将覆盖文件,保留名称,但通过truncate将其清除,然后将写入要保留在清除文件的每一行上的每一行。
或者,您可以为每个csv创建一个新文件:
for file in fileList:
keepRows = []
with open(file, 'r') as oFile, open('new_file.csv', 'a') as nFile:
for row in oFile:
if row[0] != "Start":
keepRows += row
else:
oFile.close()
for row in keepRows:
nFile.write(row)
使用a
打开,每次都将光标放在下一行,因为这是append
。用户迭代之前的.writerow
方法,这就是[]
object
中每个组的row
,keepRows
append
binary file mode
.writer()
1}}不需要迭代,并将分组中的每个项目写入自己的列,移动到下一行并执行相同的操作。
编辑:更新了# input data x,y,z:
x<-rep(c(1:100),times=10)
y<-rep(c(1:10),each=100)
z<-NULL
for(i in 1:10){
n<-sample(c(10:30),1)
m<-sample(c(50:70),1)
z<-c(z,c(1:n,sample(c(50:100),100-m-n),c(m:1)))}
# bilinear interpolation of irregular data:
library('akima')
interpo<-interp(x=x,y=y,z=z,xo=seq(min(x),max(x),length.out=1000),yo=seq(min(y),max(y),length.out=1000))
image(interpo)
# convert data format for ggplot:
x<-rep(interpo$x,times=1000)
y<-rep(interpo$y,each=1000)
r<-NULL
for(i in 1:1000){
r<-c(r,interpo$z[,i])
}
plo<-data.frame(x,y,r)
# plot the countour plot:
library(ggplot2)
library(scales)
ggplot(plo, aes(y = y, x = x, fill = r)) +
geom_raster()+
scale_fill_gradient(low="blue",high="red",limits=c(min(r),max(r)))
和r
的语法。