Question

我有一个巨大的文本文件，它遵循以下结构：

SET
TAG1
...
...
SET
...
SET
TAG2
...
...
SET
...
...

我想提取一个特定的TAG，（即TAG54）其个别的“子结构”，这将是

SET
TAG54
...
...
SET

对于给定的TAG_i，每个子结构始终包含：

第一行：SET 第二行：TAG_i（在本例中为TAG54）任意数量的行最后一行：SET

我想知道在bash或python中执行此操作的最佳方法是什么，因此对于给定的TAG，可以“提取”此子结构。

由于

Answer 1

这是一种Python方法：传入打开的文件句柄作为第一个参数，标记号作为第二个参数，然后返回相关行的列表（包括换行符），如果是空行则返回空行在文件中找不到该标记：

def lookfor(f, tagnum):
  tag = 'TAG%s\n' % tagnum
  for line in f:
    if line == tag:
       break
  else: # file finished, tag not found
    return []
  result = ['SET\n', tag]
  for line in f:
    result.append(line)
    if line == 'SET\n':
        break
  return result

这应该是相当不错的表现。如果你想要其他形式的论点和/或结果，当然不应该相应地进行调整。

Answer 2

如果您的系统的grep支持-P for perl regexp：

grep -P 'SET\nTAG54\n[.\n]*\nSET' file.txt

Answer 3

csplit -f tags input.txt '%^TAG54$%-1' '/^SET$/+1' '%.*%' '{*}'

Answer 4

GAWK：

BEGIN {
  state=0
}

state==0 && $0=="TAG54" {
  print "SET"
  state=1
}

state==1 {
  print
}

state==1 && $0=="SET" {
  exit
}

Answer 5

$ awk -vRS="SET" '/TAG54/{print RT$0RT}' file
SET
TAG54
...
...
SET

如果您使用shell脚本执行此操作，请使用awk将shell变量传递给-v。例如

#!/bin/bash
read -r -p "what's your tag? " tag
awk -vRS="SET" -vt="$tag" '$0~tag{print RT$0RT}' file

使用bash或python从文本文件中提取子结构

5 个答案: