无法在Php中完全清除刮擦内容

时间:2017-02-11 19:08:27

标签: php html regex utf-8 web-scraping

我试图根据商品的商品编号来降低宜家产品的价格。因此,给用户一个表格,所需要的只是复制/粘贴商品编号,结果应该返回商品的价格。

我遇到的问题是我也在刮取EURO(€)符号,并尝试我似乎无法完全删除它。

在Chrome上,它似乎已经消失,但在Firefox或微软边缘测试时仍然会出现。

我需要它的原因是我想添加一个函数,如果数量大于1,它将返回项目的价格,这是我不能做的事情,而我的刮擦字符串包括欧元标志。

<html >
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <!-- Bootstrap -->
    <link href="css/bootstrap.min.css" rel="stylesheet">
    <link href="css/styles.css" rel="stylesheet">
</head>
<body>
<div class="container">
<form method="get" action="calcprocess.php">
    <label for="name">Enter Product Number</label>
    <input type="text" name="product_number" required>
    <input type="submit">
</form>

以上是我用来输入文章代码的代码

<?php
if (isset($_GET['product_number'])) {
//url
    $url = "http://www.ikea.com/it/it/catalog/products/";
//    Product Number
    $product = $_GET['product_number'];
//    removes the fullstops in the product number
    $product = preg_replace('/[^a-z0-9]+/i', '', $product);
//    Search the html of the page for the Name of the product
    $keyword_name = "name";
//    Search the html of the page for the Price of the product
    $keyword_price = "price1";
//    Return the URL of the desired product
    $item = $url . $product;
    echo $item;
    /**
     * Downloads a web page from $url, selects the the element by $id
     * and returns it's xml string representation.
     */
    function getElementByIdAsString($url, $id, $pretty = true)
    {
        $doc = new DOMDocument();
        @$doc->loadHTMLFile($url);

        if (!$doc) {
            throw new Exception("Failed to load $url");
        }

        // Obtain the element
        $element = $doc->getElementById($id);

        if (!$element) {
            throw new Exception("An element with id $id was not found");
        }

        if ($pretty) {
            $doc->formatOutput = true;
        }

        // Return the string representation of the element
        return $doc->saveXML($element);
    }
//    Obtain the price
    $item_cost = getElementByIdAsString($item,$keyword_price) ;
//    Strips the Euro sign
    $symbols = array('$', '€', '£', 'â¬Â', '&euro', "\xE2\x82\xAc", "&nbsp;", "Â");
    $item_cost = str_replace($symbols,'',$item_cost);
//    $item_cost = preg_replace('/[^a-z0-9]+/i', '', $item_cost);
    echo $item_cost;
//    Problem HERE :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Euro symbol not being removed, Cannot multiply item cost by quantity
    $totalcost = $item_cost + $item_cost;
    echo $totalcost

    ?>

    <table class="calctable">
        <tr id="calchead">
            <th>Item Code</th>
            <th>Product Name</th>
            <th>Price</th>
        </tr>
        <tr>
            <td><?php echo $product;?></td>
            <td><?php echo getElementByIdAsString($item,$keyword_name);?></td>
            <td><?php echo "&euro".$item_cost;?></td>
        </tr>
    </table>
    </div>
    </body>
    </html>
    <?php


}?>

我已经尝试过str_replace,preg_replace,但根据firefox,这些都没有完全删除欧元符号。我得到的结果是:

http://www.ikea.com/it/it/catalog/products/70103349 Â 119,90 0
Item Code   Product Name    Price
70103349    
MALM
    €  119,90 

要清楚,我的问题是如何在抓取网站时彻底清理欧元符号,以便我可以对值进行数学计算。任何帮助表示赞赏。

0 个答案:

没有答案