Question

我有几个大文本文本文件都具有相同的结构，我想删除前3行，然后从第4行删除非法字符。我不想读取整个数据集然后修改，因为每个文件超过100MB，记录超过400万。

Range   150.0dB -64.9dBm
Mobile unit 1   Base    -17.19968    145.40369  999.8
Fixed unit  2   Mobile  -17.20180    145.29514  533.0
Latitude    Longitude   Rx(dB)  Best unit
-17.06694    145.23158  -050.5  2
-17.06695    145.23297  -044.1  2

因此应删除第1,2和3行，在第4行中，“Rx（db）”应仅为“Rx”，“Best Unit”应更改为“Best_Unit”。然后我可以使用我的其他脚本对数据进行地理编码。

我不能使用像grep（as in this question）这样的命令行程序，因为前3行并不完全相同 - 每个文件中的数字（例如150.0dB，-64 *）都会改变，所以你有只删除整行1-3，然后grep或类似的可以在第4行进行搜索替换。

谢谢你们，

===编辑新的pythonic方法来处理来自@heltonbiker的更大文件。错误。

import os, re
##infile = arcpy.GetParameter(0)
##chunk_size = arcpy.GetParameter(1) # number of records in each dataset

infile='trc_emerald.txt'
fc= open(infile)
Name = infile[:infile.rfind('.')]
outfile = Name+'_db.txt'

line4 = fc.readlines(100)[3]
line4 = re.sub('\([^\)].*?\)', '', line4)
line4 = re.sub('Best(\s.*?)', 'Best_', line4)
newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]])
fc.close()
newfile = open(outfile, 'w')
newfile.write(newfilestring)
newfile.close()

del lines
del outfile
del Name
#return chunk_size, fl
#arcpy.SetParameterAsText(2, fl)
print "Completed"

回溯（最近一次调用最后一次）：文件“P：\ 2012 \ Job_044_DM_Radio_Propogation \ Working \ FinalPropogation \ TRC_Emerald \ working \ clean_file_1c.py”，         第13行，在             newfilestring =''。join（line4 + [fc.readlines [4：]中的行代码]）TypeError：'builtin_function_or_method'对象是         unsubscriptable

Answer 1

正如wim在评论中所说，sed是正确的工具。以下命令应该执行您想要的操作：

sed -i -e '4 s/(dB)//' -e '4 s/Best Unit/Best_Unit/' -e '1,3 d' yourfile.whatever

稍微解释一下命令：

-i执行命令，即将输出写回输入文件

-e执行命令

第'4 s/(dB)//'行{p> 4，''替换'(dB)'

'4 s/Best Unit/Best_Unit/'与上述相同，但不同的查找和替换字符串

从第1行到第3行（包括）的

'1,3 d'删除整行

sed是一个非常强大的工具，它可以做更多的事情，非常值得学习。

Answer 2

只为每个文件尝试。每个文件100 MB不是大，正如您所看到的，只是尝试的代码编写起来并不费时。

with open('file.txt') as f:
  lines = f.readlines()
lines[:] = lines[3:]
lines[0] = lines[0].replace('Rx(db)', 'Rx')
lines[0] = lines[0].replace('Best Unit', 'Best_Unit')
with open('output.txt', 'w') as f:
  f.write('\n'.join(lines))

Answer 3

您可以将file.readlines()与aditional参数一起使用，以便从文件中只读取几行。来自文档：

f.readlines（）返回一个包含所有数据行的列表文件。如果给出一个可选的参数sizehint，它会读取很多来自文件的字节以及足以完成一行，并返回那条线。这通常用于允许有效阅读一行大文件，但无需加载整个文件记忆。只返回完整的行。

然后，操纵通用字符串的最强大的方法是正则表达式。在Python中，这意味着re模块，例如re.sub()函数。

我的建议，应根据您的需求进行调整：

import re

f = open('somefile.txt')
line4 = f.readlines(100)[3]
line4 = re.sub('\([^\)].*?\)', '', line4)
line4 = re.sub('Best(\s.*?)', 'Best_', line4)
newfilestring = ''.join(line4 + [line for line in f.readlines[4:]])
f.close()
newfile = open('someotherfile.txt', 'w')
newfile.write(newfilestring)
newfile.close()

从python中的大文本文件中删除特定行

3 个答案: