Question

如果这样的问题已经得到回答，我很抱歉，但是我太新了，不能用脚本来判断它是否已被回答。

我想将网页的html源传递给脚本，以便它可以修改/抓取其HTML标记的网页。我尝试过的一个例子：

cat webpage.htm | ./dosomething

dosomething的代码如下

#!/bin/bash

export LC_ALL='C'

echo "testing"
echo $1 #this is the part where I'd like to be able to access the html that I've passed into the script
echo "still testing"
sed 's/<[^>]*>//g' < $1 #trying to strip the html tags of the webpage that I've passed in

当猫不起作用时，我尝试了：

./ dosomething＆lt; webpage.htm

我的脚本代码也不适用于此。脚本需要从标准输入读取HTML并在将修改后的HTML放到标准输出之前对其进行修改 - 我无法将网页作为实际参数传递：

./ dosomething webpage.htm

Answer 1

如果要从网页中删除html标记，则命令行浏览器已经解决了这个问题。看一下lynx -dump选项

lynx -dump http://www.subir.com/lynx.html

elinks有类似的选择，对w3c不太确定

Answer 2

由于源已经通过标准输入提供给脚本，因此脚本中的命令会继承此输入，因此您不能将输入重定向到那里 - 删除< $1。

现在祝你在bash中处理HTML的勇敢承诺。

将html源代码传递给bash脚本并进行操作

2 个答案: