查找并从列表中删除元素,同时保留稍后插入的位置

时间:2013-10-22 00:21:54

标签: python python-2.7

在Python 2.7中使用以下内容:

dfile = 'new_data.txt'   #  Depth file no. 1
d_row = [line.strip() for line in open(dfile)]

我已将数据文件加载到没有换行符的列表中。现在我想索引d_row中的所有元素,其中字符串的开头不是数字和/或空。接下来,我要求:

  1. 删除所有上述详细的非数字实例和
  2. 保存字符串和索引,以便以后插入更新的文件。
  3. 数据示例:

    Thu Mar 14 18:17:05 2013                                                       
    Fri Mar 15 01:40:25 2013
    
    FT
    
    DepthChange: 0.000000,2895.336,0.000
    1363285025.250000,9498.970
    1363285025.300000,9498.970
    1363285026.050000,9498.970
    1363287840.450042,9458.010
    1363287840.500042,9458.010
    1363287840.850042,9458.010
    1363287840.900042,9458.010
    DepthChange: 0.000000,2882.810,9457.200
    1363287840.950042,9458.010
    DepthChange: 0.000000,2882.810,0.000
    1363287841.000042,9457.170
    1363287841.050042,9457.170
    1363287841.100042,9457.170
    1363287841.150042,9457.170
    1363287841.200042,9457.170
    1363287841.250042,9457.170
    1363287841.300042,9457.170
    1363291902.750102,9149.937
    1363291902.800102,9149.822
    1363291902.850102,9149.822
    1363291902.900102,9149.822
    1363291902.950102,9149.822
    1363291903.000102,9149.822
    1363291903.050102,9149.708
    1363291903.100102,9149.708
    1363291903.150102,9149.708
    1363291903.200102,9149.708
    1363291903.250102,9149.708
    1363291903.300102,9149.592
    1363291903.350102,9149.592
    1363291903.400102,9149.592
    1363291903.450102,9149.592
    1363291903.500102,9149.592
    DepthChange: 0.000000,2788.770,2788.709
    1363291903.550102,9149.479
    1363291903.600102,9149.379
    

    我一直在手动执行删除步骤,这非常耗时,因为该文件包含超过五十万行。目前,我无法通过一些修改来重写包含所有原始元素的文件。

    任何提示都会非常感激。

3 个答案:

答案 0 :(得分:0)

dfile = 'new_data.txt'
with open(dfile) as infile:
  numericLines = set() # line numbers of lines that start with digits
  emptyLines = set() # line numbers of lines that are empty
  charLines = [] # line numbers of lines that start with a letter
  for lineno, line in enumerate(infile):
    if line[0].isalpha:
      charLines.append(line.strip())
    elif line[0].isdigit():
      numericLines.add(lineno)
    elif not line.strip():
      emptyLines.add(lineno)

答案 1 :(得分:0)

最简单的方法是两次传递:首先获取不匹配线的线和线号,然后获取匹配线的线。

d_rows = [line.strip() for line in open(dfile)]
good_rows = [(i, row) for i, row in enumerate(d_rows) if is_good_row(row)]
bad_rows = [(i, row) for i, row in enumerate(d_rows) if not is_good_row(row)]

这确实意味着在列表上进行两次传递,但谁在乎呢?如果列表足够小,可以像往常一样将整个内容读入内存,那么额外的成本可能可以忽略不计。

或者,如果您需要避免在两遍中构建两个列表的成本,您可能还需要避免一次性读取整个文件,因此您必须更聪明地做一些事情:

d_rows = (line.strip() for line in open(dfile)) # notice genexp, not list comp
good_rows, bad_rows = [], []
for i, row in enumerate(d_rows):
    if is_good_row(row):
        good_rows.append((i, row))
    else:
        bad_rows.append((i, row))

如果你可以把事情推得更远甚至不需要明确的good_rowsbad_rows列表,那么你可以将所有内容保存在迭代器中,并且不要浪费记忆或前期阅读时间:

d_rows = (line.strip() for line in open(dfile)) # notice genexp, not list comp
with open(outfile, 'w') as f:
    for i, row in enumerate(d_rows):
        if is_good_row(row):
            f.write(row + '\n')
        else:
            whatever_you_wanted_to_do_with(i, row)

答案 2 :(得分:0)

感谢所有回答我问题的人。使用每个回复的一部分,我能够达到预期的结果。最终奏效如下:

goodrow_ind, badrow_ind, badrows = [], [], []

d_rows = (line for line in open(ifile))
with open(ofile, 'w') as f:
    for i, row in enumerate(d_rows):
        if row[0].isdigit():
            f.write(row)
            goodrow_ind.append((i))
        else:
            badrow_ind.append((i))
            badrows.append((row))

ifile.close()

data = np.loadtxt(open(ofile,'rb'),delimiter=',')

结果是“好”和“坏”行用每个索引分隔。