这是我的问题:我尝试解析xml提要并提取两个字段(标题和链接) - 这部分工作正常。如何删除所有html标签并将结果保存为csv格式,例如
标题,链接
标题,链接
标题,链接
#!/bin/sh
url="http://www.buzzfeed.com/usnews.xml"
curl --silent "$url" | grep -E '(title>|link>)' >> output
答案 0 :(得分:2)
使用XML解析器解析XML。我假设您需要Feed项的标题和链接,而不是Feed本身。
curl --silent "$url" |
xmlstarlet sel -t -m '/rss/channel/item' -v 'title' -n -v 'link' -n |
awk '{
title=$0
gsub(/"/, "&&", title)
getline
printf "\"%s\",\"%s\"\n", title, $0
}'
xmlstarlet命令解析feed,并为每个/rss/channel/item
输出标题值和单独行上的链接值。然后awk拿起流并按摩它为CSV。
只是为了好玩,这个awk的sed版本:
sed -n 's/"/&&/g;s/^\|$/"/g;h;n;s/"/&&/g;s/^\|$/"/g;x;G;s/\n/,/;p'
或
sed -n ' # do not automatically print
# current line is the title
s/"/&&/g # double up any double quotes (CSV quote escaping)
s/^\|$/"/g # add leading and trailing double quotes
h # store current pattern space (title) into hold space
n # read the next line (the link) from input
s/"/&&/g # double up any double quotes (CSV quote escaping)
s/^\|$/"/g # add leading and trailing double quotes
x # exchange pattern space (link) and hold space (title)
G # append a newline to title and then append link
s/\n/,/ # replace the newline with a comma
p # and print it
'