正确的正则表达式可将字符串与最大数值匹配

时间:2019-08-11 09:32:46

标签: regex

我正在尝试找出一种使用如下正则表达式匹配所有字符串的方法

输入字符串:

    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt
    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_400.txt
    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_250.txt
    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_10.txt

    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_640.txt
    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt
    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_540.txt

    https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt
    https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_100.txt

    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_640.txt
    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_540.txt
    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv

预期输出:

    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt
    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt
    https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt
    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv

我正在尝试下面的表达式,但是它获取了所有的url,如何将结果限制为我想要的?

    "https://subdomain.domain.com/([^,:"]+?([_\d]*?)).(txt|csv)"

3 个答案:

答案 0 :(得分:1)

如果您的问题确实按照您的问题进行了分组,那么这样做很容易
使用正则表达式。

@"(?m)(?:^[^\S\r\n]*(https?://\S+?_)(\d+)\.(txt|csv)[^\S\r\n]*$\r?\n)+(?=\s*\r\n|$)"

解释

 (?m)
 (?:                           # Cluster group for block
      ^                             # BOL
      [^\S\r\n]*                    # Optional horizontal whitespace
      ( https?:// \S+? _ )          # (1), Location
      ( \d+ )                       # (2), Number
      \. 
      ( txt | csv )                 # (3), Extension
      [^\S\r\n]*                    # Optional horizontal whitespace
      $ \r? \n                      # EOL plus linebreak
 )+                            # End cluster, 1 to many times
 (?= \s* \r \n | $ )           # Lookahead to determine where the end of block is

C#代码示例

var str =
"    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt\n" + 
"    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_400.txt\n" +
"    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_250.txt\n" +
"    https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_10.txt\n" +
"\n" +
"    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_640.txt\n" +
"    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt\n" +
"    https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_540.txt\n" +
"\n" +
"    https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt\n" +
"    https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_100.txt\n" +
"\n" +
"    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_640.txt\n" +
"    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_540.txt\n" +
"    https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv\n" +
"\n";

// This regex matches a block each time
var RxBlock = new Regex(@"(?m)(?:^[^\S\r\n]*(https?://\S+?_)(\d+)\.(txt|csv)[^\S\r\n]*$\r?\n)+(?=\s*\r\n|$)");

Match M = RxBlock.Match(str);
while (M.Success)
{
    CaptureCollection ccFileLoc = M.Groups[1].Captures;  // location
    CaptureCollection ccFileNum = M.Groups[2].Captures;  // number
    CaptureCollection ccFileExt = M.Groups[3].Captures;  // extension

    String Loc = ccFileLoc[0].Value;
    String Ext = ccFileExt[0].Value;
    int Largest = 0;
    bool bValid = true;

    if (Int32.TryParse(ccFileNum[0].Value, out Largest))
    {
        int cur_num = 0;
        int cnt = ccFileLoc.Count;

        for (int i = 0; bValid && i < cnt; i++)
        {
            if (!Int32.TryParse(ccFileNum[i].Value, out cur_num) || ccFileLoc[i].Value != Loc)
                bValid = false;
            else
            if (cur_num > Largest)
            {
                Largest = cur_num;
                Ext = ccFileExt[i].Value;
            }
        }
    }
    else
        bValid = false;

    if ( bValid )
        Console.WriteLine("{0}{1}.{2} ", Loc, Largest, Ext);

    M = M.NextMatch();
}

输出

https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt
https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt
https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt
https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv

即使您的数据未排序,您仍然可以通过这种方式使用正则表达式。
它必须先对它进行行排序。
然后,需要稍作修改。如果要这样做
方式,让我知道,我可能会告诉你如何。

答案 1 :(得分:0)

您可以使用否定的字符类[^,:"]+来匹配逗号,冒号或双引号。我认为您不必使用?

使它变得不贪心

然后匹配1+个数字,后跟一个下划线以及使用alternation列出的任何数字(?:500|1280|980)

对于示例数据,您可以匹配下划线或数字非贪婪[_\d]*?而不是0+倍,而还可以匹配下划线\d+_的1+数字

请注意转义点\.以使其与字面值匹配。

https://subdomain\.domain\.com/[^,:"]+\d+_(?:500|1280|980)\.(?:txt|csv)

Regex demo

答案 2 :(得分:0)

据我了解,使用Regex几乎不可能实现这样的事情,我已经在不使用regex的情况下使用LINQ在C#中实现了这一点。感谢Burdui,在尝试您的建议时我想到了这个。

    public List<string> FindUnique(List<string> Urls)
    {
        var distinct = Urls.Distinct();
        var grouping = distinct.GroupBy(x => x.Substring(1, x.LastIndexOf('_')));

        if (grouping.Count() > 0)
        { 
            return grouping.Select(x =>
                x.First(a =>
                    a.Contains(x.Max(y =>
                        Int32.Parse(y.Substring(y.LastIndexOf('_') + 1).Split('.')[0])).ToString())
                )
            ).ToList();
        }
        else
        {
            return distinct.ToList();
        }
    }