Question

我有一个这样的文件：

This \word{is} some text.
This is some \word{more text}.
\word{This} is \word{yet} some more \word{text}.

我需要创建\word{和匹配的右括号}之间出现的所有文本的列表，例如：

is
more text
This
yet
text

打开和关闭括号始终显示在同一行上，从不跨越多行。
文档中存在其他大括号，但\word{}内没有大括号。

如何打印\word{}中出现的所有文字的列表？

Answer 1

好像你正在处理一个TeX文件......那么为什么不直接使用TeX呢？然后你就会确定不存在任何问题和副作用，例如，

\word {there's a space between \verb=\word= and the curly bracket}

这仍然可行！它仍适用于多行内容：

\word{this is
    a multiline stuff \emph{and you can even add more groupings in it,}
    it'll still work fine!}

在你的（La）TeX前言中，只需添加：

\newwrite\file
\immediate\openout\file=output.txt

\def\word#1{\immediate\write\file{#1}}

如果您使用的是LaTeX而不是plainTeX，请使用{{1>}。

您也可以将\newcommand放在\immediate\write\file{#1}定义宏中。如果您无权访问\word宏（例如，它在类或样式文件中），您可以：

\word

希望这有帮助！

Answer 2

具有PCRE功能的grep将完成这项工作：

grep -Po "(?<=\\word{)[^}]*(?=})" file

现场演示：http://ideone.com/uzEzBF

Answer 3

纯bash解决方案，无需调用任何外部实用程序：

while read -r x; do
  while [[ $x =~ \\word{([^}]+)} ]]; do
    echo ${BASH_REMATCH[1]}
    x=${x#*$BASH_REMATCH}
  done
done <infile

输入文件：

$ cat infile
This \word{is} some text.
{This \word{is}}some text.
This is some \word{more text}.
\word{This} is \word{yet} some more \word{text}.

输出：

is
is
more text
This
yet
text

技巧是-r bash内置函数中设置的read选项。这不会将\视为行读取中的转义字符。然后循环，同时在字符串中找到\word{...}模式。然后打印内部匹配的字符串，并输入刺痛。

对于小文件（1-2 MB），我会使用此版本，因为它使用非常少的资源。但对于大型文件，我建议使用anubhava的perl-regex - grep solution，因为它可以更有效地读取文件！

Answer 4

由于并非所有版本的grep都有PCRE，因此这里只使用扩展的正则表达式。

grep -Eo "\\word{.+}" file_name | sed -e "s/\\word{//" -e "s/}//"

Answer 5

$ cat testfile
This \word{is} some text.
This is some \word{more text}.
\word{This} is \word{yet} some more \word{text}.

$ awk '$0 ~ /\\word{[^}]*}/ { nelts = split($0, arr, /\\word{/); for (i=1; i <= nelts; i++) if (arr[i] ~ /^[^}]*}/) print substr(arr[i], 1, index(arr[i], "}") - 1); }' testfile
is
more text
This
yet
text

如果碰巧有\word{\word{STRING}}，STRING会打印出来。换句话说，它递归地工作。对不起，如果那不是你想要的。

Answer 6

混合grep和sed：

egrep -o '\\word\{[^\{\}]+\}' | sed 's/\\word{//;s/}//'

为了好玩，我还制作了一个纯粹的bash版本：

while read -r l
do
    n=${#l}
    ll="${l#*\\word{}"
    while [ $n -ne ${#ll} ]
    do
        echo "${ll%%\}*}"
        n=${#ll}
        ll="${ll#*\\word{}"
    done
done

不是很干净，但它适用于您的示例

Answer 7

GNU代码sed：

sed -nr ':b;s/(\\word\{[^}]+\})/\1\n/;s/.*\\word\{([^}]+)\}\n/\1\n/;T;P;D;tb' file

$ cat file
This \word{is} some text.
This is some \word{more text}.
\word{This} is \word{yet} some more \word{text}.
{\word{This} is \word{yet} {some} more \word{text}.}

$ sed -nr ':b;s/(\\word\{[^}]+\})/\1\n/;s/.*\\word\{([^}]+)\}\n/\1\n/;T;P;D;tb' file
is
more text
This
yet
text
This
yet
text

Answer 8

发明awk是为了进行文本处理：

$ awk 'sub(/.*\\word{/,"")' RS='}' file
is
more text
This
yet
text
is

$ cat file
This \word{is} some text.
This is some \word{more text}.
\word{This} is \word{yet} some more \word{text}.
{ This \word{is} some text }

Answer 9

perl也有帮助：

perl -nlE 'say "$_" for (m/\\word\{(.*?)\}/g);'  < tex.txt

输入：

This{ \word{is}} some text.
This is some \word{more text}.
This is {some \word{aaa text}} This is {some \word{bbb text} This is some \word{ccc text}} This is some {\word{ddd text}}
{\word{This} is \word{yet} some more \word{text}.}

打印：

is
more text
aaa text
bbb text
ccc text
ddd text
This
yet
text

Answer 10

使用sed：

sed 's/.*\\word{\([^}]*\)}.*/\1/g' input.txt

上面的表达式会删除除括号内的内容之外的所有内容。如果将来需要匹配多行，awk可能会更容易：

awk -F "\\word{" 'BEGIN { RS = "}" } { print $2 }' input.txt

这会将\word{设置为字段分隔符，将}设置为记录分隔符，这意味着$2会引用括号内的内容。

如何找到BASH中``word {}`之间出现的所有单词？

10 个答案:

现场演示：http://ideone.com/uzEzBF