如何从rss feed中删除html标签并将结果保存为带有shell脚本的CSV?

时间:2015-01-29 13:42:08

标签: html bash csv grep rss

这是我的问题:我尝试解析xml提要并提取两个字段(标题和链接) - 这部分工作正常。如何删除所有html标签并将结果保存为csv格式,例如

标题,链接
标题,链接
标题,链接

#!/bin/sh
url="http://www.buzzfeed.com/usnews.xml"
curl --silent "$url" | grep -E '(title>|link>)' >> output

1 个答案:

答案 0 :(得分:2)

使用XML解析器解析XML。我假设您需要Feed项的标题和链接,而不是Feed本身。

curl --silent "$url" | 
xmlstarlet sel -t -m '/rss/channel/item' -v 'title' -n -v 'link' -n | 
awk '{
    title=$0
    gsub(/"/, "&&", title)
    getline
    printf "\"%s\",\"%s\"\n", title, $0
}'

xmlstarlet命令解析feed,并为每个/rss/channel/item输出标题值和单独行上的链接值。然后awk拿起流并按摩它为CSV。

只是为了好玩,这个awk的sed版本:

sed -n 's/"/&&/g;s/^\|$/"/g;h;n;s/"/&&/g;s/^\|$/"/g;x;G;s/\n/,/;p'

sed -n '         #  do not automatically print
                 #  current line is the title
    s/"/&&/g     #  double up any double quotes (CSV quote escaping)
    s/^\|$/"/g   #  add leading and trailing double quotes
    h            #  store current pattern space (title) into hold space
    n            #  read the next line (the link) from input
    s/"/&&/g     #  double up any double quotes (CSV quote escaping)
    s/^\|$/"/g   #  add leading and trailing double quotes
    x            #  exchange pattern space (link) and hold space (title)
    G            #  append a newline to title and then append link
    s/\n/,/      #  replace the newline with a comma
    p            #  and print it
'