在bash中从html中提取td / tr?

时间:2012-12-28 23:42:32

标签: html bash web-scraping html-table

我得到了页面http://www.cpubenchmark.net/cpu_list.php,我想用他们的名字,等级和基准分数来提取给定的CPU。

示例(“Intel Core i5”):

Intel Core i5-3450 @ 3.10GHz - Score: 3333 - Rank: 1
Intel Core i5-3450S @ 2.80GHz - Score: 2222 - Rank: 2
Intel Core i5-2380P @ 3.10GHz - Score: 1111 - Rank: 3
...

我怎样才能在bash中这样做?尝试从类似的东西开始(没有cpu过滤 - 不知道它是如何工作的):

#!/bin/sh
curl http://www.cpubenchmark.net/cpu_list.php | grep '^<TR><TD>' \
| sed \
    -e 's:<TR>::g'  \
    -e 's:</TR>::g' \
    -e 's:</TD>::g' \
    -e 's:<TD>: :g' \
| cut -c2- >> /home/test.txt

输出是这样的:

<A HREF="cpu_lookup.php?cpu=686+Gen&amp;id=1495">686 Gen</A> 288 1559 NA NA
<A HREF="cpu_lookup.php?cpu=AMD+A10-4600M+APU&amp;id=10">AMD A10-4600M APU</A> 3175 388 NA NA
<A HREF="cpu_lookup.php?cpu=AMD+A10-4655M+APU&amp;id=11">AMD A10-4655M APU</A> 3017 406 NA NA

2 个答案:

答案 0 :(得分:4)

如果您想下载其他程序,可以使用我的Xidel

所有CPU:

xidel http://www.cpubenchmark.net/cpu_list.php -e '//table[@id="cputable"]//tr/concat(td[1], " - Score: ", td[2], " - Rank: ", td[3])'

以英特尔开始......

xidel http://www.cpubenchmark.net/cpu_list.php -e '//table[@id="cputable"]//tr[starts-with(td[1], "Intel Core i5")]/concat(td[1], " - Score: ", td[2], " - Rank: ", td[3])'

它甚至可以对它们进行排序(之前从未使用过该功能):

xidel http://www.cpubenchmark.net/cpu_list.php -e 'for $row in //table[@id="cputable"]//tr[starts-with(td[1], "Intel Core i5")] order by $row/td[3] return $row/concat(td[1], " - Score: ", td[2], " - Rank: ", td[3])' --extract-kind=xquery

答案 1 :(得分:0)

严格按照当前页面格式定制的bash解决方案:

#! /bin/bash

function nextcell
{
    cell=${line%%</TD>*}
    # remove closing link tag if any
    cell=${cell%</?>}
    cell=${cell##*>}
    line=${line#*</TD>}
}

while read line
do
    if [[ ! "$line" =~ cpu_lookup.php ]]
    then
        continue
    fi
    nextcell
    echo -n "$cell"
    nextcell
    echo -n " - Score: $cell"
    nextcell
    echo " - Rank: $cell"
done