从页面中提取数据

时间:2013-09-05 12:51:24

标签: php html file-get-contents html-content-extraction

您好我想创建一个页面html和php,它能够获取此链接所包含的表格中的数据:http://www.comuni-italiani.it/province.html

我希望有任何提示,我会使用file_get_content但我不知道如何获取所有各种数据

2 个答案:

答案 0 :(得分:1)

您能否更清楚地向我们解释您希望从此页面中完成的内容?

无论如何,为了做到这一点,您可以使用file_get_contents来获取页面,然后根据您要从页面中获取的内容(我想您要从表中的页面中获取每个<td>元素),您可以使用PHP regular expressions (preg_match, preg_match_all)来获取所需的所有数据。

您的案例示例:

$page = file_get_contents("http://www.comuni-italiani.it/province.html");

$output = array();
preg_match_all('/<td.*.<\/td>/',$page,$output);

print_r($output);

这将输出:

Array ( [0] => Array ( [0] =>    [1] => [2] => Agrigento [3] => Alessandria [4] => Ancona [5] => Aosta [6] => Arezzo [7] => Ascoli Piceno [8] => Asti [9] => Avellino [10] => Bari [11] => Barletta-Andria-Trani [12] => Belluno [13] => Benevento [14] => Bergamo [15] => Biella [16] => Bologna [17] => Bolzano [18] => Brescia [19] => Brindisi [20] => Cagliari [21] => Caltanissetta [22] => Campobasso [23] => Carbonia-Iglesias [24] => Caserta [25] => Catania [26] => Catanzaro [27] => Chieti [28] => Como [29] => Cosenza [30] => Cremona [31] => Crotone [32] => Cuneo [33] => Enna [34] => Fermo [35] => Ferrara [36] => Firenze [37] => Foggia [38] => Forlì-Cesena [39] => Frosinone [40] => Genova [41] => Gorizia [42] => Grosseto [43] => Imperia [44] => Isernia [45] => La Spezia [46] => L'Aquila [47] => Latina [48] => Lecce [49] => Lecco [50] => Livorno [51] => Lodi [52] => Lucca [53] => Macerata [54] => Mantova [55] => Massa-Carrara [56] => Matera [57] => Messina [58] => Milano [59] => Modena [60] => Monza e della Brianza [61] => Napoli [62] => Novara [63] => Nuoro [64] => Olbia-Tempio [65] => Oristano [66] => Padova [67] => Palermo [68] => Parma [69] => Pavia [70] => Perugia [71] => Pesaro e Urbino [72] => Pescara [73] => Piacenza [74] => Pisa [75] => Pistoia [76] => Pordenone [77] => Potenza [78] => Prato [79] => Ragusa [80] => Ravenna [81] => Reggio Calabria [82] => Reggio Emilia [83] => Rieti [84] => Rimini [85] => Roma [86] => Rovigo [87] => Salerno [88] => Medio Campidano [89] => Sassari [90] => Savona [91] => Siena [92] => Siracusa [93] => Sondrio [94] => Taranto [95] => Teramo [96] => Terni [97] => Torino [98] => Ogliastra [99] => Trapani [100] => Trento [101] => Treviso [102] => Trieste [103] => Udine [104] => Varese [105] => Venezia [106] => Verbano-Cusio-Ossola [107] => Vercelli [108] => Verona [109] => Vibo Valentia [110] => Vicenza [111] => Viterbo [112] => CercaNel Sito e sul WebPagine UtiliElenco Province per PopolazionePrincipali Città ItalianeLista Alfabetica RegioniAmministrazioni LocaliScuole in Italia [113] =>   ) )

当然可以过滤。

在您的情况下,例如,通过添加一个小的foreach循环......:

$page = file_get_contents("http://www.comuni-italiani.it/province.html");

    $output = array();
    preg_match_all('/<td.*.<\/td>/',$page,$output);

    $provinces = array();

    foreach ($output as $id => $list) {
        for ($i = 2; $i <= 111; $i++) {
            array_push($provinces,$list[$i]);
        }
    }

    print_r($provinces);

会给你这个:

Array ( [0] => Agrigento [1] => Alessandria [2] => Ancona [3] => Aosta [4] => Arezzo [5] => Ascoli Piceno [6] => Asti [7] => Avellino [8] => Bari [9] => Barletta-Andria-Trani [10] => Belluno [11] => Benevento [12] => Bergamo [13] => Biella [14] => Bologna [15] => Bolzano [16] => Brescia [17] => Brindisi [18] => Cagliari [19] => Caltanissetta [20] => Campobasso [21] => Carbonia-Iglesias [22] => Caserta [23] => Catania [24] => Catanzaro [25] => Chieti [26] => Como [27] => Cosenza [28] => Cremona [29] => Crotone [30] => Cuneo [31] => Enna [32] => Fermo [33] => Ferrara [34] => Firenze [35] => Foggia [36] => Forlì-Cesena [37] => Frosinone [38] => Genova [39] => Gorizia [40] => Grosseto [41] => Imperia [42] => Isernia [43] => La Spezia [44] => L'Aquila [45] => Latina [46] => Lecce [47] => Lecco [48] => Livorno [49] => Lodi [50] => Lucca [51] => Macerata [52] => Mantova [53] => Massa-Carrara [54] => Matera [55] => Messina [56] => Milano [57] => Modena [58] => Monza e della Brianza [59] => Napoli [60] => Novara [61] => Nuoro [62] => Olbia-Tempio [63] => Oristano [64] => Padova [65] => Palermo [66] => Parma [67] => Pavia [68] => Perugia [69] => Pesaro e Urbino [70] => Pescara [71] => Piacenza [72] => Pisa [73] => Pistoia [74] => Pordenone [75] => Potenza [76] => Prato [77] => Ragusa [78] => Ravenna [79] => Reggio Calabria [80] => Reggio Emilia [81] => Rieti [82] => Rimini [83] => Roma [84] => Rovigo [85] => Salerno [86] => Medio Campidano [87] => Sassari [88] => Savona [89] => Siena [90] => Siracusa [91] => Sondrio [92] => Taranto [93] => Teramo [94] => Terni [95] => Torino [96] => Ogliastra [97] => Trapani [98] => Trento [99] => Treviso [100] => Trieste [101] => Udine [102] => Varese [103] => Venezia [104] => Verbano-Cusio-Ossola [105] => Vercelli [106] => Verona [107] => Vibo Valentia [108] => Vicenza [109] => Viterbo )

(对不起巨大的数组)。

然而,保持数组内的链接是这样的,如果你只想获取值而不是与之关联的锚,只需随意使用另一个正则表达式

希望这有帮助。

(以此为例,请记住,如果页面发生变化,这个foreach技巧可能不再起作用,我发布它只是为了让您了解如何解决该案例。)

答案 1 :(得分:0)

尝试详细了解DOMDocument参考:http://php.net/manual/en/class.domdocument.php

此外,这些问题可能对您有所帮助:

Getting DOM elements by classname

PHP Parse HTML code