您好我正在使用pandas读取一系列文件并将它们连接到数据帧。我的文件开头有一堆垃圾,可变长度,我想忽略。 pd.read_csv()
具有skiprows方法。我已经编写了一个函数来处理这种情况,但我必须打开文件两次才能使它工作。还有更好的方法吗?
HEADER = '#Start'
def header_index(file_name):
with open(file_name) as fp:
for ind, line in enumerate(fp):
if line.startswith(HEADER):
return ind
for row in directories:
path2file = '%s%s%s' % (path2data, row, suffix)
myDF = pd.read_csv(path2file, skiprows=header_index(path2file), header=0, delimiter='\t')
非常感谢任何帮助。
答案 0 :(得分:0)
现在可以(如下所示)(不知道是否有可能)如下:
pos= 0
oldpos = None
while pos != oldpos: # make sure we stop reading, in case we reach EOF
line= fp.readline()
if line.startswith(HEADER):
# set the read position to the start of the line
# so pandas can read the header
fp.seek(pos)
break
oldpos= pos
pos= fp.tell() # renenber this position as sthe start of the next line
pd.read_csv(fp, ...your options here...)
答案 1 :(得分:0)
由于# (Details of the ffmpeg command omitted for brevity).
... | foreach {
# The RHS of && is only executed if the command on the LHS succeeded,
# which in the case of external programs means: $LASTEXITCODE equals 0.
ffmpeg $_ ... | Write-Host && $_
} | ...
还接受类似object的文件,因此您可以在传递该对象之前跳过开头的垃圾行---而不是传递文件名。
示例:
替换
read_csv()
具有:
df = pd.read_csv(filename, skiprows=no_junk_lines(filename), ...)
注意:
def forward_csv(f, prefix):
pos = 0
while True:
line = f.readline()
if not line or line.startswith(prefix):
f.seek(pos)
return f
pos += len(line.encode('utf-8'))
df = pd.read_csv(forward_csv(open(filename), HEADER), ...)
在到达EOF时返回空字符串readline()
跟踪位置可节省一些tell()
系统调用lseek
的最后一行假定您的输入文件是用ASCII或UTF-8编码的-如果不是,则必须调整此行