Question

我有一个gawk脚本，它在变量中累积了一堆HTML，现在应该通过系统命令将它传递给lynx。

（随意告诉我AWK是一个糟糕的解决方案...... while read LINE;非常糟糕（慢），所以这需要2）

我在awk中试过这个：

    cmd = sprintf( "bash -c \'lynx -dump -force_html -stdin <<< \"%s\"\'", html )
    system ( cmd )

不好的想法，虽然简单的测试用例工作，原始的HTML，特殊的字符问题和字符串终止问题比比皆是，并且在逃逸内逃脱只是令人难以置信的复杂。

lynx处理好我在stdin上抛出的任何东西，我只是无法从awk获取stdin而不通过命令行管道，这似乎是一个笨重的解决方案。

编辑（添加有关我的最终目标的详细信息），以防awk不是一个好方法：

我想要的是从一个大文本文件中解析HTML，其中包含html块之间的分隔符。我需要将每个HTML块传递给lynx进行格式化并将其转储到一个新的大文本文件中。

示例输入（来自其他系统的转储）：

**********URL: http://some/url
<html>
<head><title>Any 'ol HTML document</title</head>
<body>
<p>With pretty much any character you can imagine at some point</p>
<p>I'm using lynx to strip off the HTML and give me a nice format</p>
</body>
</html>
**********URL: http://another/url
<html><head><title>My input file provides a few 100,000 such html documents</title></head>
<body/></html>

每个HTML文档都应通过lynx -dump提供。 Lynx可以从文件读取HTML（例如命名管道，或文件是一个选项），或stdin（使用-stdin选项）。

我的输出是：

**********URL: http://some/url
  Any 'ol HTML document

  With pretty much any character you can imagine at some point
  I'm using lynx to strip off the HTML and give me a nice format
**********URL: http://another/url
  My input file provides a few 100,000 such html documents

Answer 1

尝试|& in gawk.，我从here发现了这一点。这将允许您将gawk的输出作为协同进程发送到另一个命令的stdin。

Answer 2

要添加到n0741337的答案，这里是一个使用gawk coprocesses的例子，我在阅读他的答案后做了，它从stdin获取“aline”，并将其传递给cat coprocess，并捕获cat coprocess的输出并打印出来：

printf "aline" | awk '
  BEGIN{cmd="cat"} 
  {
    print $0 |& cmd; 
    close(cmd, "to"); 
    while ((cmd |& getline line) > 0) { 
      print "got", line 
    }; 
    close (cmd);
  }'

result: got aline

gawk手册对此功能进行了更广泛的讨论：http://www.gnu.org/software/gawk/manual/html_node/Two_002dway-I_002fO.html#Two_002dway-I_002fO

将未过滤的文本粘贴到awk系统命令的标准输入

2 个答案: