我正在尝试提取txt文件的所有站号。我将此文件从xml转换为.txt
这是我要从中提取文字的示例。
sudo
我尝试了grep和sed。
<tr bgcolor="#F2F9FF"><td headers="Station Name"><a href="display.php?stid=KMLJ">Milledgeville, Baldwin County Airport</a> (KMLJ)</td>
<td headers="rss"><div align="center"><a href="KMLJ.rss"><img src="/images/rss.jpg" alt="RSS Format" width="36" height="14" border="0"></a></div></td>
<td headers="xml"><div align="center"><a href="KMLJ.xml"><img src="/images/xml.gif" alt="XML Format" width="36" height="14" border="0"></a></div></td>
</tr>
<tr><td headers="Station Name"><a href="display.php?stid=K2J5">Millen Airport</a> (K2J5)</td>
<td headers="rss"><div align="center"><a href="K2J5.rss"><img src="/images/rss.jpg" alt="RSS Format" width="36" height="14" border="0"></a></div></td>
<td headers="xml"><div align="center"><a href="K2J5.xml"><img src="/images/xml.gif" alt="XML Format" width="36" height="14" border="0"></a></div></td>
</tr>
<tr bgcolor="#F2F9FF"><td headers="Station Name"><a href="display.php?stid=KD73">Monroe-Walton County Airport</a> (KD73)</td>
<td headers="rss"><div align="center"><a href="KD73.rss"><img src="/images/rss.jpg" alt="RSS Format" width="36" height="14" border="0"></a></div></td>
<td headers="xml"><div align="center"><a href="KD73.xml"><img src="/images/xml.gif" alt="XML Format" width="36" height="14" border="0"></a></div></td>
</tr>
我想仅使用所需的字符串(即(KMLJ)(K2J5)(KD73))导出到csv或文本文件
答案 0 :(得分:1)
您可以使用
grep -o '([Kk][^()]*)' stations.txt
或者,要获取不带括号的值:
grep -Po '\(\K[Kk][^()]+' stations.txt # GNU grep required
# Or, just pipe a sed to remove the initial (:
grep -o '([Kk][^()]*' stations.txt | sed 's/^(//'
或者,如果每行只有一个值,则仅使用sed
:
sed -n 's/.*(\([kK][^()]*\).*/\1/p' stations.txt
-o
选项将仅输出匹配的文本。
([Kk][^()]*)
是与以下内容匹配的POSIX BRE模式:
(
-文字(
字符[Kk]
-匹配k
或K
[^()]*
-与除(
和)
以外的任何char零次或多次匹配的否定括号表达式)
-一个)
字符。答案 1 :(得分:0)
你的意思是那样的吗?
sed -n '/Station Name/ {s/.*stid=\([^"]*\)">\([^<]*\)<.*/\1 \2/;p}' file.txt
说明
sed -n # use sed with no default output
'/Station Name/ # use onle lines with Station Name
{ # start block
s # substitute
/ # separator
.*stid=\([^"]*\) # extract Station ID and save it in arg1 (\1)
"> # ignore this pattern
\([^<]*\) # extract Station Name and save it in arg2 (\2)
<.* # ignore rest of line
/\1 \2/;p # print arg1 and arg2
} # end of block
' file.txt # read from this file
测试数据的输出
KMLJ Milledgeville, Baldwin County Airport
K2J5 Millen Airport
KD73 Monroe-Walton County Airport