正则表达式帮助使用分布在多行中的搜索词来查找文本块

时间:2016-07-15 10:26:53

标签: c# regex text

说我有以下文本块:

ONE asd blah| 1| 123| 222| -0.03| -62333| -2253| -121.26| -1120.12| XCT
TWO Three
Nine Twelve
Twenty
DDD

ONE ads blah| 42| 555| 5423| -345| -5422| -399815| -345| -345| XCT
TWO Three
Six Seven
Twenty
DDD

现在,我想找到具有以下所有内容的文本块:

ONE, TWO, Three, Nine, Twelve, Twenty

这应匹配第一个块而不是第二个

然后,同样地:

ONE, TWO, Three, Six, Seven, Twenty

匹配第二个区块但不匹配第一个区块。

我怎样才能做到这一点?

我尝试使用以下内容搜索ONE但不包括下一个ONE的所有文字:

ONE((.|\n)*)(?=^ONE)

作为一个开始,但即使这样也行不通!

4 个答案:

答案 0 :(得分:1)

既然你说这些术语必须按顺序发生,那很简单:

ONE(?:(?!ONE).)*?TWO(?:(?!ONE).)*?Three(?:(?!ONE).)*?Nine(?:(?!ONE).)*?Twelve(?:(?!ONE).)*?Twenty(?:(?!ONE).)*

匹配第一个块但不匹配第二个块。测试live on regex101.com

<强>解释

(?:(?!ONE).)*?

匹配任意数量的字符,除非它们位于短语ONE的开头。这可以确保您不会跨越到不同的块中。

确保使用RegexOptions.Singleline编译正则表达式,以便点匹配换行符。

答案 1 :(得分:0)

(?=.*?\bONE\b)(?=.*?\bTWO\b)(?=.*?\bThree\b)(?=.*?\bNine\b)(?=.*?\bTwelve\b)(?=.*?\bTwenty\b).*\n\n

匹配您的第一个区块。以单行模式应用(大多数正则表达式文字符号中的修饰符s,或构造正则表达式对象时的标志)。

这是在.*\n\n最终匹配块之前必须满足的条件列表(以任何顺序)。每个条件都是正向前瞻,可以搜索单个单词。

请参阅:https://regex101.com/r/sC4tR1/1

这不是“完美”,因为没有块边界检测。如果块边界在字符串中是常规的,则可以展开表达式以合并它们。

另一种策略是将字符串预先拆分为单独的块,然后在这些块上运行表达式而不是整个字符串。

答案 2 :(得分:0)

我已经解析了这样的文字40年了。通常不能使用正则表达式。尝试以下代码

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;


namespace ConsoleApplication3
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.txt";
        static void Main(string[] args)
        {
            StreamReader reader = new StreamReader(FILENAME);
            string inputLine = "";
            Block block = null;
            while ((inputLine = reader.ReadLine()) != null)
            {
                inputLine = inputLine.Trim();
                if (inputLine.Length > 0)
                {
                    if (inputLine.StartsWith("ONE"))
                    {
                        block = new Block();
                        Block.blocks.Add(block);
                    }
                    block.lines.Add(inputLine);
                }
            }
        }
     }
    public class Block
    {
        public static List<Block> blocks = new List<Block>(); 
        public List<string> lines { get; set; }
        public Block()
        {
            lines = new List<string>();
        }
    }
}

答案 3 :(得分:0)

您是否尝试从特定文本结构中提取特定单词(正则表达式测试/匹配),或者您是否尝试查看给定文本中是否包含特定单词(因为您似乎知道要查找哪些单词)< / p>

如果是后者,AhoCorasic怎么样?

我过去曾经用过这个。这是一种非常非常快速的搜索特定字符串集文本的算法。

// Copyright (c) 2013 Pēteris Ņikiforovs
// 
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
// 
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
// 
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.

using System.Collections;
using System.Collections.Generic;

    /// <summary>
    /// Trie that will find and return strings found in a text.
    /// </summary>
    public class Trie : Trie<string>
    {
        public Trie(){}

        public Trie(IEnumerable<string> source)
        {
            Add(source);
            Build();
        }
        /// <summary>
        /// Adds a string.
        /// </summary>
        /// <param name="s">The string to add.</param>
        public void Add(string s)
        {
            Add(s, s);
        }

        /// <summary>
        /// Adds multiple strings.
        /// </summary>
        /// <param name="strings">The strings to add.</param>
        public void Add(IEnumerable<string> strings)
        {
            foreach (string s in strings)
            {
                Add(s);
            }
        }
    }

    /// <summary>
    /// Trie that will find strings in a text and return values of type <typeparamref name="T"/>
    /// for each string found.
    /// </summary>
    /// <typeparam name="TValue">Value type.</typeparam>
    public class Trie<TValue> : Trie<char, TValue>
    {
    }

    /// <summary>
    /// Trie that will find strings or phrases and return values of type <typeparamref name="T"/>
    /// for each string or phrase found.
    /// </summary>
    /// <remarks>
    /// <typeparamref name="T"/> will typically be a char for finding strings
    /// or a string for finding phrases or whole words.
    /// </remarks>
    /// <typeparam name="T">The type of a letter in a word.</typeparam>
    /// <typeparam name="TValue">The type of the value that will be returned when the word is found.</typeparam>
    public class Trie<T, TValue>
    {
        /// <summary>
        /// Root of the trie. It has no value and no parent.
        /// </summary>
        private readonly Node<T, TValue> root = new Node<T, TValue>();

        /// <summary>
        /// Adds a word to the tree.
        /// </summary>
        /// <remarks>
        /// A word consists of letters. A node is built for each letter.
        /// If the letter type is char, then the word will be a string, since it consists of letters.
        /// But a letter could also be a string which means that a node will be added
        /// for each word and so the word is actually a phrase.
        /// </remarks>
        /// <param name="word">The word that will be searched.</param>
        /// <param name="value">The value that will be returned when the word is found.</param>
        public void Add(IEnumerable<T> word, TValue value)
        {
            // start at the root
            var node = root;

            // build a branch for the word, one letter at a time
            // if a letter node doesn't exist, add it
            foreach (T c in word)
            {
                var child = node[c];

                if (child == null)
                    child = node[c] = new Node<T, TValue>(c, node);

                node = child;
            }

            // mark the end of the branch
            // by adding a value that will be returned when this word is found in a text
            node.Values.Add(value);
        }


        /// <summary>
        /// Constructs fail or fall links.
        /// </summary>
        public void Build()
        {
            // construction is done using breadth-first-search
            var queue = new Queue<Node<T, TValue>>();
            queue.Enqueue(root);

            while (queue.Count > 0)
            {
                var node = queue.Dequeue();

                // visit children
                foreach (var child in node)
                    queue.Enqueue(child);

                // fail link of root is root
                if (node == root)
                {
                    root.Fail = root;
                    continue;
                }

                var fail = node.Parent.Fail;

                while (fail[node.Word] == null && fail != root)
                    fail = fail.Fail;

                node.Fail = fail[node.Word] ?? root;
                if (node.Fail == node)
                    node.Fail = root;
            }
        }

        /// <summary>
        /// Finds all added words in a text.
        /// </summary>
        /// <param name="text">The text to search in.</param>
        /// <returns>The values that were added for the found words.</returns>
        public IEnumerable<TValue> Find(IEnumerable<T> text)
        {
            var node = root;

            foreach (T c in text)
            {
                while (node[c] == null && node != root)
                    node = node.Fail;

                node = node[c] ?? root;

                for (var t = node; t != root; t = t.Fail)
                {
                    foreach (TValue value in t.Values)
                        yield return value;
                }
            }
        }

        /// <summary>
        /// Node in a trie.
        /// </summary>
        /// <typeparam name="TNode">The same as the parent type.</typeparam>
        /// <typeparam name="TNodeValue">The same as the parent value type.</typeparam>
        private class Node<TNode, TNodeValue> : IEnumerable<Node<TNode, TNodeValue>>
        {
            private readonly TNode word;
            private readonly Node<TNode, TNodeValue> parent;
            private readonly Dictionary<TNode, Node<TNode, TNodeValue>> children = new Dictionary<TNode, Node<TNode, TNodeValue>>();
            private readonly List<TNodeValue> values = new List<TNodeValue>();

            /// <summary>
            /// Constructor for the root node.
            /// </summary>
            public Node()
            {
            }

            /// <summary>
            /// Constructor for a node with a word
            /// </summary>
            /// <param name="word"></param>
            /// <param name="parent"></param>
            public Node(TNode word, Node<TNode, TNodeValue> parent)
            {
                this.word = word;
                this.parent = parent;
            }

            /// <summary>
            /// Word (or letter) for this node.
            /// </summary>
            public TNode Word
            {
                get { return word; }
            }

            /// <summary>
            /// Parent node.
            /// </summary>
            public Node<TNode, TNodeValue> Parent
            {
                get { return parent; }
            }

            /// <summary>
            /// Fail or fall node.
            /// </summary>
            public Node<TNode, TNodeValue> Fail
            {
                get;
                set;
            }

            /// <summary>
            /// Children for this node.
            /// </summary>
            /// <param name="c">Child word.</param>
            /// <returns>Child node.</returns>
            public Node<TNode, TNodeValue> this[TNode c]
            {
                get { return children.ContainsKey(c) ? children[c] : null; }
                set { children[c] = value; }
            }

            /// <summary>
            /// Values for words that end at this node.
            /// </summary>
            public List<TNodeValue> Values
            {
                get { return values; }
            }

            /// <inherit/>
            public IEnumerator<Node<TNode, TNodeValue>> GetEnumerator()
            {
                return children.Values.GetEnumerator();
            }

            /// <inherit/>
            IEnumerator IEnumerable.GetEnumerator()
            {
                return GetEnumerator();
            }

            /// <inherit/>
            public override string ToString()
            {
                return Word.ToString();
            }
        }
    }