Question

我有几个大型文本文件（30米+行，> 1GB），这些文件在拆分后在ArcGIS中处理（请参阅Remove specific lines from a large text file in python和chunk a text database into N equal blocks and retain header了解背景信息）。

即使拆分过程需要3天以上，所以我想删除所有（Rx）值小于或等于0的xy点。

我没有使用python来读取超过500Mb的txt数据集，所以我使用了cygwin / SED命令来执行数据的初始清理，然后使用python来对文件进行分块。理想情况下，这个过程是向python添加一些代码（见下文），不包括Rx＆lt; = 0的所有行。

Latitude    Longitude   Rx  Best_Unit
-16.37617    144.68805  -012.9  7
-16.37617    144.68834  -015.1  7
-16.37617    144.68861  -017.2  7
-16.37617    144.68890  -018.1  7
-16.37617    144.68919  -025.0  7
-16.37617    144.68945  -019.5  7
-16.37617    144.68974  -020.0  7
-16.37617    144.69003  -020.4  7
-16.37617    144.69623   015.3  7
-16.37617    144.69652   015.6  7
-16.37617    144.69679   015.8  7
-16.37617    144.69708   016.0  7
-16.37617    144.70076   005.0  7
-16.37617    144.70103   002.2  7
-16.37617    144.70131  -000.2  7
-16.37617    144.70160  -001.5  7
-16.37617    144.70187  -001.0  7
-16.37617    144.70216   000.7  7
-16.37617    144.70245   002.2  7
-16.37617    144.70273   008.4  7
-16.37617    144.70300   017.1  7
-16.37617    144.70329   017.2  7

我希望将Rx> 0的所有行（行）写入新的文本文件中。我还想删除Best_Unit列。

from itertools import islice

import arcpy, os
#fc = arcpy.GetParameter(0)
#chunk_size = arcpy.GetParameter(1) # number of records in each dataset

fc='cb_vhn007_5.txt'
Name = fc[:fc.rfind('.')]
fl = Name+'.txt'

headers_count = 1
chunk_size = 500000

with open(fl) as fin:
  headers = list(islice(fin, headers_count))

  part = 1
  while True:
    line_iter = islice(fin, chunk_size)
    try:
      first_line = line_iter.next()
    except StopIteration:
      break
    with open(Name+'_%d.txt' % part, 'w') as fout:
      for line in headers:
        fout.write(line)
      fout.write(first_line)
      for line in line_iter:
         ## add something here to check if value after third tab
         ## is >0 and if so then write the row or skip.
        fout.write(line) 

    print "Created part %d" % part
    part += 1

新代码 - 第一行包括 - Rx值。

from itertools import islice

import arcpy, os
#fc = arcpy.GetParameter(0)
#chunk_size = arcpy.GetParameter(1) # number of records in each dataset

fc='cb_vhn007_5.txt'
Name = fc[:fc.rfind('.')]
fl = Name+'.txt'

headers_count = 1
chunk_size = 500000

with open(fl) as fin:
  headers = list(islice(fin, headers_count))

  part = 1
  while True:
    line_iter = islice(fin, chunk_size)
    try:
      first_line = line_iter.next()
    except StopIteration:
      break
    with open(Name+'_%d.txt' % part, 'w') as fout:
      for line in headers:
        fout.write(line)
      fout.write(first_line)
      for line in line_iter:
        if line.split()[2][0:1] != '-':
          #print line.split()[2]
          fout.write(line)

    print "Created part %d" % part
    part += 1

Answer 1

可能只需检查line_iter[24] != '-'即可。

即。替换：

fout.write(line)

与

if line_iter[24] != '-':
  fout.write(line)

Answer 2

您可以使用line.split()将您的行拆分为包含4列中每列的值的列表。

例如：

line='-16.37617\t144.70329\t017.2\t7'
line.split()
# ['-16.37617', '144.70329', '017.2', '7']

然后你可以强制line[2]（记住python是基于0的索引）到一个数字并检查它是否是＆gt; 0：

if float(line.split()[2]) > 0:
    fout.write(line)

或者你可以检查它是否有减号：

if line.split()[2].find('-') != -1:
    fout.write(line)

如果您的列可能与每个文本文件的顺序不同，则可以在标题上执行split()，确定哪个是Rx，并使用该列而不是2：

i = headers.split().index('Rx')
# now use line[i]

Answer 3

我知道它不是python，但它可能是这项工作的正确工具：

cat cb_vhn007_5.txt | awk '($3 > 0) {print $0}' > parsedfile

Answer 4

您正在寻找的行是这样的：

if line.split()[2][0:1] != "-"
  fout.write(line)

这会分割输入，查看第三个条目，查看第一个字符，如果是-则跳过该行。

Answer 5

这是一个python脚本，它将读取一个文件，其中包含格式化为每行四个以空格分隔的字段的数据，检查第三个字段，并输出第三个字段为正浮点数的任何行。

测试了Python 2.7.2。

import re

in_fh = open ("gis.txt","r")
out_fh = open ("outfile.txt","w")

for row in in_fh:
    row = re.sub(' +',',',row) # convert to comma-separated format
    try:
        latitude, longitude, rx, best_unit = row.split(',')
    except ValueError: # row didn't have four fields
        print ("complain - not four fields")
        continue

    try:
        float_rx = float(rx)
    except ValueError: # rx could not be cast to float
        print ("complain - third field not float")
        continue

    if float_rx > 0:
        out_fh.write(latitude + "," + longitude + "," + rx + "\n")
    else:
        pass # discard the row

in_fh.close()
out_fh.close()

一次只处理一行，因此无论输入和输出文件的大小如何，内存使用量都应保持不变。

或者，你考虑过使用数据库吗？ sqlite3内置了，可能会处理1Gb的数据。然后你可以通过SELECT * FROM data WHERE rx > 0来获得这个结果。

删除具有小于或等于0的特定属性的行

5 个答案: