原件：

Question

我正在尝试从wepage中获取表的内容。我需要内容而不是标签<tr></tr>。我甚至不需要“tr”或“td”内容。例如：

<td> I want only this </td>
<tr> and also this </tr>
<TABLE> only texts/numbers in between tags and not the tags. </TABLE>

我也想把这样的第一列输出放在一个新的csv文件中 COLUMN1，INFO1，INFO2，INFO3 coumn2，INFO1，INFO2，INFO3

我尝试删除了帖子<tr> <td>，但是当我获取表时，还有其他标签，例如<color> <span>等等。所以我想要删除所有标签;简言之，一切都与＆lt;和＆gt; 。

Answer 1

sed 's/<[^>]\+>//g'会删除所有代码，但您可能希望将其替换为空格，以便彼此相邻的代码不会一起运行：<td>one</td><td>two</td>成为：onetwo 。所以你可以sed 's/<[^>]\+>/ /g'这样输出one two（实际上是one two）。

除非您只需要原始文本，并且听起来像是在剥离标记后尝试对数据进行一些转换，因此像Perl这样的脚本语言可能是更适合使用这些内容的工具。 / p>

由于mu太短，提到抓取HTML可能有点冒险，使用实际解析HTML的东西你将是最好的方法。 PHPs DOM API对于这些事情非常有用。

Answer 2

原件：

Mac终端REGEX的行为略有不同。我可以使用以下示例在我的Mac上执行此操作：

$ curl google.com | sed 's/<[^>]*>//g'
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   219  100   219    0     0    385      0 --:--:-- --:--:-- --:--:--   385

301 Moved
301 Moved
The document has moved
here.

$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

编辑：

为了澄清起见，原始文字看起来像：

$ curl googl.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>

还可以使用-s选项摆脱恼人的curl标头：

$ curl -s google.com | sed 's/<[^>]*>//g' 

301 Moved
301 Moved
The document has moved
here.

$

删除sed或类似的html标签

2 个答案:

原件：

编辑：