Notepad ++删除内部带有特定文本的标签

时间:2016-05-30 14:37:23

标签: php regex xml notepad++

我有一个包含产品的大型XML文件。我试图删除所有缺货的产品。文件大小超过20MB。

<product>
  <name>bla1</name>
  <price>50$</price>
  <stock>yes</stock>
  <description>bla</description>
</product>

<product>
  <name>bla2</name>
  <price>60$</price>
  <stock>no</stock>
  <description>bla</description>
</product>

...

是否可以使用Notepad ++的正则表达式删除它们,还是应该使用simpleXML(PHP)或类似的东西?

我的基本PHP代码:

$url = 'input/products.xml';
    $xml = new SimpleXMLElement(file_get_contents($url));

    foreach ($xml->product->children() as $product) {

        //finding out of stock products and deleting them

    }
    $xml->asXml('output/products.xml');

3 个答案:

答案 0 :(得分:2)

转发

通过正则表达式进行模式匹配并不理想,如果您可以访问PHP,那么我建议使用正确的HTLM解析工具。话虽如此,我提供了一个可以在Notepad ++中使用的解决方案

描述

<product\s*(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\s?\/?>(?:(?!</product).)*<stock\s*(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\s?\/?>no</stock>(?:(?!</product).)*<\/product>

替换为: 没有

Regular expression visualization

要更好地查看图像,可以右键单击它并在新窗口中选择视图。

此正则表达式将执行以下操作:

  • 查找整个产品部分
  • 需要子标签stock
  • 要求子标签stock的值为no
  • 避免在HTML中进行模式匹配的极端边缘情况

来自Notepad ++

从Notepad ++中,请注意您应该使用notpad ++ 6.1或更高版本,因为旧版本中的正则表达式存在问题,现在已经解决了。

  1. ctrl h 进入查找和替换 模式

  2. 选择正则表达式选项

  3. 在&#34;找到什么&#34;字段放置正则表达式

  4. 在&#34;替换为&#34;字段输入``

  5. 点击全部替换

  6. 实施例

    现场演示

    https://regex101.com/r/cW9nC5/1

    示例文字

    <product>
      <name>bla1</name>
      <price>50$</price>
      <stock>yes</stock>
      <description>bla</description>
    </product>
    
    <product>
      <name>bla2</name>
      <price>60$</price>
      <stock>no</stock>
      <description>bla</description>
    </product>
    

    替换后

    <product>
      <name>bla1</name>
      <price>50$</price>
      <stock>yes</stock>
      <description>bla</description>
    </product>
    

    解释

    NODE                     EXPLANATION
    ----------------------------------------------------------------------
      <product                 '<product'
    ----------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    ----------------------------------------------------------------------
      (?:                      group, but do not capture (0 or more times
                               (matching the least amount possible)):
    ----------------------------------------------------------------------
        [^>=]                    any character except: '>', '='
    ----------------------------------------------------------------------
       |                        OR
    ----------------------------------------------------------------------
        ='                       '=\''
    ----------------------------------------------------------------------
        [^']*                    any character except: ''' (0 or more
                                 times (matching the most amount
                                 possible))
    ----------------------------------------------------------------------
        '                        '\''
    ----------------------------------------------------------------------
       |                        OR
    ----------------------------------------------------------------------
        ="                       '="'
    ----------------------------------------------------------------------
        [^"]*                    any character except: '"' (0 or more
                                 times (matching the most amount
                                 possible))
    ----------------------------------------------------------------------
        "                        '"'
    ----------------------------------------------------------------------
       |                        OR
    ----------------------------------------------------------------------
        =                        '='
    ----------------------------------------------------------------------
        [^'"]                    any character except: ''', '"'
    ----------------------------------------------------------------------
        [^\s>]*                  any character except: whitespace (\n,
                                 \r, \t, \f, and " "), '>' (0 or more
                                 times (matching the most amount
                                 possible))
    ----------------------------------------------------------------------
      )*?                      end of grouping
    ----------------------------------------------------------------------
      \s?                      whitespace (\n, \r, \t, \f, and " ")
                               (optional (matching the most amount
                               possible))
    ----------------------------------------------------------------------
      \/?                      '/' (optional (matching the most amount
                               possible))
    ----------------------------------------------------------------------
      >                        '>\r\n'
    ----------------------------------------------------------------------
      (?:                      group, but do not capture (0 or more times
                               (matching the most amount possible)):
    ----------------------------------------------------------------------
        (?!                      look ahead to see if there is not:
    ----------------------------------------------------------------------
          </product                '</product'
    ----------------------------------------------------------------------
        )                        end of look-ahead
    ----------------------------------------------------------------------
        .                        any character except \n
    ----------------------------------------------------------------------
      )*                       end of grouping
    ----------------------------------------------------------------------
      <stock                   '<stock'
    ----------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    ----------------------------------------------------------------------
      (?:                      group, but do not capture (0 or more times
                               (matching the least amount possible)):
    ----------------------------------------------------------------------
        [^>=]                    any character except: '>', '='
    ----------------------------------------------------------------------
       |                        OR
    ----------------------------------------------------------------------
        ='                       '=\''
    ----------------------------------------------------------------------
        [^']*                    any character except: ''' (0 or more
                                 times (matching the most amount
                                 possible))
    ----------------------------------------------------------------------
        '                        '\''
    ----------------------------------------------------------------------
       |                        OR
    ----------------------------------------------------------------------
        ="                       '="'
    ----------------------------------------------------------------------
        [^"]*                    any character except: '"' (0 or more
                                 times (matching the most amount
                                 possible))
    ----------------------------------------------------------------------
        "                        '"'
    ----------------------------------------------------------------------
       |                        OR
    ----------------------------------------------------------------------
        =                        '='
    ----------------------------------------------------------------------
        [^'"]                    any character except: ''', '"'
    ----------------------------------------------------------------------
        [^\s>]*                  any character except: whitespace (\n,
                                 \r, \t, \f, and " "), '>' (0 or more
                                 times (matching the most amount
                                 possible))
    ----------------------------------------------------------------------
      )*?                      end of grouping
    ----------------------------------------------------------------------
      \s?                      whitespace (\n, \r, \t, \f, and " ")
                               (optional (matching the most amount
                               possible))
    ----------------------------------------------------------------------
      \/?                      '/' (optional (matching the most amount
                               possible))
    ----------------------------------------------------------------------
      >no</stock>              '>no</stock>'
    ----------------------------------------------------------------------
      (?:                      group, but do not capture (0 or more times
                               (matching the most amount possible)):
    ----------------------------------------------------------------------
        (?!                      look ahead to see if there is not:
    ----------------------------------------------------------------------
          </product                '</product'
    ----------------------------------------------------------------------
        )                        end of look-ahead
    ----------------------------------------------------------------------
        .                        any character except \n
    ----------------------------------------------------------------------
      )*                       end of grouping
    ----------------------------------------------------------------------
      <                        '<'
    ----------------------------------------------------------------------
      \/                       '/'
    ----------------------------------------------------------------------
      product>                 'product>'
    ----------------------------------------------------------------------
    

答案 1 :(得分:1)

我猜记事本++会更容易,即:

发现: <product>\s+<name>.*?<\/name>\s+<price>.*?<\/price>\s+<stock>no<\/stock>\s+<description>.*?\/description>\s+<\/product>
REPLACE:没有任何东西

DEMO

https://regex101.com/r/fH0mM7/1

注意

请务必检查底部的Regular Expression

答案 2 :(得分:1)

您可以使用以下代码

使用PHP执行此操作
<?php
    $url = 'input/products.xml';
    $xml = new SimpleXMLElement(file_get_contents($url));
    $i = count($xml) - 1; 
    for ($i; $i >= 0; --$i) {   
       $product = $xml->product[$i];
       if ($product->stock == "no") {
          unset($xml->product[$i]);
       }
    }
    $xml->asXml('output/products.xml');
    ?>