删除带有制表符的行

时间:2014-12-08 20:01:20

标签: python bash awk sed text-files

如何删除带有标签的行?

我的文件看起来像这样:

0   absinth
Bohemian-style absinth
Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately described as a kind of wormwood bitters.
It is produced mainly in the Czech Republic, from which it gets its designations as “Bohemian” or “Czech,” although not all absinthe from the Czech Republic is Bohemian-style.

1   acidophilus milk
Sweet acidophilus milk is consumed by individuals who suffer from lactose intolerance or maldigestion, which occurs when enzymes (lactase) cannot break down lactose (milk sugar) in the intestine.
To aid digestion in those with lactose intolerance, milk with added bacterial cultures such as "Lactobacillus acidophilus" ("acidophilus milk") and bifidobacteria ("a/B milk") is available in some areas.
High Activity of Lactobacillus Acidophilus Milk

2   adobo
Adobo
Adobo (Spanish: marinade, sauce, or seasoning) is the immersion of raw food in a stock (or sauce) composed variously of paprika, oregano, salt, garlic, and vinegar to preserve and enhance its flavor.
In the Philippines, the name "adobo" was given by the Spanish colonists to an indigenous cooking method that also uses vinegar, which although superficially similar had developed independent of Spanish influence.

所需的输出包含删除了标签的行,即:

Bohemian-style absinth
Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately described as a kind of wormwood bitters.
It is produced mainly in the Czech Republic, from which it gets its designations as “Bohemian” or “Czech,” although not all absinthe from the Czech Republic is Bohemian-style.

Sweet acidophilus milk is consumed by individuals who suffer from lactose intolerance or maldigestion, which occurs when enzymes (lactase) cannot break down lactose (milk sugar) in the intestine.
To aid digestion in those with lactose intolerance, milk with added bacterial cultures such as "Lactobacillus acidophilus" ("acidophilus milk") and bifidobacteria ("a/B milk") is available in some areas.
High Activity of Lactobacillus Acidophilus Milk

Adobo
Adobo (Spanish: marinade, sauce, or seasoning) is the immersion of raw food in a stock (or sauce) composed variously of paprika, oregano, salt, garlic, and vinegar to preserve and enhance its flavor.
In the Philippines, the name "adobo" was given by the Spanish colonists to an indigenous cooking method that also uses vinegar, which although superficially similar had developed independent of Spanish influence.

我可以在python中执行以下操作以获得相同的结果:

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
  for line in fin:
    if '\t' in line:
      continue
    else:
      fout.write(line)

但我有数百万行,而且效率不高。所以我试着用剪切删除第二行,然后删除单个字符的行:

$ cut -f1 WIKI_WN_food | awk 'length>1' | less

获得所需输出的更多pythonic方法是什么?

是否有比上面显示的切割+ awk管道更有效的方法?

6 个答案:

答案 0 :(得分:2)

您的代码没问题,您可以尝试仅在字符串的开头进行优化:

if `\t' not in l[:5]: fout.write(l)

其中子串的长度取决于最大记录数,它可能会对不匹配的长字符串产生影响,谁知道......

此外,您可能希望在

中测试mawkgrep
# Edit : the following won't work. it strips also blank lines
# mawk -F"\t" "NF==1"  original > stripped
grep -vF "\t"        original > stripped
sed -e "/\t/d"       original > stripped

看它是否比python解决方案更快。

测试

在我的系统上,通过反复复制你的文件获得一个文件。它的大小1,418,973,184 我有近似次,如下所示:grep 1.6s,sed 6.4s,python 4.6s。 python运行时并不依赖于搜索整个字符串或子字符串。

附录

我使用mawk测试了Jidder awk解决方案(在对OP的评论中给出),我的近似时间为3.2秒。在这里,为了它的价值......获胜者是grep -vF

测试成绩单

两次执行之间的运行时间相差0.1秒,这里我只报告每个命令的一个运行时间......对于接近结果,我们无法做出明确的决定。

另一方面,不同的工具给出的结果远比实验误差大得多,我认为我们可以得出一些结论......

% ls -l original 
-rw-r--r-- 1 boffi boffi 1418973184 Dec  8 21:33 original
% cat doit.py
from sys import stdout
with open('original', 'r') as fin:
  for line in fin:
    if '\t' in line: continue
    else: stdout.write(line)
% time wc -l original 
15731133 original

real    0m0.407s
user    0m0.184s
sys     0m0.220s
% time python doit.py | wc -l
12584034

real    0m5.334s
user    0m4.880s
sys     0m1.428s
% time grep -vF "       "  original | wc -l
12584035

real    0m1.527s
user    0m1.112s
sys     0m1.400s
% time grep -v "        "  original | wc -l
12584035

real    0m1.556s
user    0m1.120s
sys     0m1.436s
% time sed -e "/\t/d"  original | wc -l
12584034

real    0m6.481s
user    0m6.104s
sys     0m1.404s
% time mawk '!/\t/'  original | wc -l
12584035

real    0m3.059s
user    0m2.608s
sys     0m1.488s
% time gawk '!/\t/'  original | wc -l
12584035

real    0m9.148s
user    0m8.680s
sys     0m1.468s
% 

我的示例文件有一个截断的最后一行,因此一方面python和sed之间的行数差异,以及所有其他工具。

答案 1 :(得分:1)

您可以使用sed

执行此操作
sed '/\t/d' 'my_file'

查找“\ at”并删除包含它的行

答案 2 :(得分:0)

grep -v '\t' file

............

答案 3 :(得分:0)

尝试将grep与Perl样式的正则表达式一起使用:

grep -vP "\t" file.in > file.out

答案 4 :(得分:0)

尝试使用filter为您提供优势

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
    fout.write(''.join([line for line in filter(
             lambda l: r'\t' not in l, fin.readlines())]))

测试条件r'\t' not in l是否适用于您的文件。您可能需要测试一组空格而不是\ t,也许使用正则表达式。我不得不将\ t编码到我的file.txt文件中以使代码生效。这就是我尝试使用正则表达式进行替换的原因:

import re

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
    fout.write(re.sub(r'^\d+\s{2,}[^\n]+', '', fin.read(), count=0, flags=re.M))

现在我只得到一条空行而不是你要消除的行。

GOT IT:正则表达式需要\n才能工作:

    fout.write(re.sub(r'^\d+\s{2,}[^\n]+\n', '', fin.read(), count=0, flags=re.M))

答案 5 :(得分:-1)

您可以尝试使用tr

tr -d " \t" < tabbed-file.txt > sanitized-file.txt

man tr

tr - translate or delete characters

-

您也可以尝试

要删除所有空格,包括从左侧到第一个单词的标签,请发出:

回声“这是一个测试”| sed -e's / ^ [\ t] * //'