我正在尝试找出一种使用如下正则表达式匹配所有字符串的方法
输入字符串:
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_400.txt
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_250.txt
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_10.txt
https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_640.txt
https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt
https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_540.txt
https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt
https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_100.txt
https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_640.txt
https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_540.txt
https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv
预期输出:
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt
https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt
https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt
https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv
我正在尝试下面的表达式,但是它获取了所有的url,如何将结果限制为我想要的?
"https://subdomain.domain.com/([^,:"]+?([_\d]*?)).(txt|csv)"
答案 0 :(得分:1)
如果您的问题确实按照您的问题进行了分组,那么这样做很容易
使用正则表达式。
@"(?m)(?:^[^\S\r\n]*(https?://\S+?_)(\d+)\.(txt|csv)[^\S\r\n]*$\r?\n)+(?=\s*\r\n|$)"
解释
(?m)
(?: # Cluster group for block
^ # BOL
[^\S\r\n]* # Optional horizontal whitespace
( https?:// \S+? _ ) # (1), Location
( \d+ ) # (2), Number
\.
( txt | csv ) # (3), Extension
[^\S\r\n]* # Optional horizontal whitespace
$ \r? \n # EOL plus linebreak
)+ # End cluster, 1 to many times
(?= \s* \r \n | $ ) # Lookahead to determine where the end of block is
C#代码示例
var str =
" https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt\n" +
" https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_400.txt\n" +
" https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_250.txt\n" +
" https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_10.txt\n" +
"\n" +
" https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_640.txt\n" +
" https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt\n" +
" https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_540.txt\n" +
"\n" +
" https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt\n" +
" https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_100.txt\n" +
"\n" +
" https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_640.txt\n" +
" https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_540.txt\n" +
" https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv\n" +
"\n";
// This regex matches a block each time
var RxBlock = new Regex(@"(?m)(?:^[^\S\r\n]*(https?://\S+?_)(\d+)\.(txt|csv)[^\S\r\n]*$\r?\n)+(?=\s*\r\n|$)");
Match M = RxBlock.Match(str);
while (M.Success)
{
CaptureCollection ccFileLoc = M.Groups[1].Captures; // location
CaptureCollection ccFileNum = M.Groups[2].Captures; // number
CaptureCollection ccFileExt = M.Groups[3].Captures; // extension
String Loc = ccFileLoc[0].Value;
String Ext = ccFileExt[0].Value;
int Largest = 0;
bool bValid = true;
if (Int32.TryParse(ccFileNum[0].Value, out Largest))
{
int cur_num = 0;
int cnt = ccFileLoc.Count;
for (int i = 0; bValid && i < cnt; i++)
{
if (!Int32.TryParse(ccFileNum[i].Value, out cur_num) || ccFileLoc[i].Value != Loc)
bValid = false;
else
if (cur_num > Largest)
{
Largest = cur_num;
Ext = ccFileExt[i].Value;
}
}
}
else
bValid = false;
if ( bValid )
Console.WriteLine("{0}{1}.{2} ", Loc, Largest, Ext);
M = M.NextMatch();
}
输出
https://subdomain.domain.com/e8cf09b4763e03d208dfd21121baacd4/domain_p6amv8xJVr1qto1_500.txt
https://subdomain.domain.com/163c7b0508062729dsdk1f1e264210/domain_p6amv8xJVr1wvilqto2_1280.txt
https://subdomain.domain.com/adfd386be957c3247/domain_p6amv8xJVr1wvilqto3_250.txt
https://subdomain.domain.com/25e5ccd5e95ca2888a39b939f199b822/domain_p6amv8xJVr1ilqto4_980.csv
即使您的数据未排序,您仍然可以通过这种方式使用正则表达式。
它必须先对它进行行排序。
然后,需要稍作修改。如果要这样做
方式,让我知道,我可能会告诉你如何。
答案 1 :(得分:0)
您可以使用否定的字符类[^,:"]+
来匹配逗号,冒号或双引号。我认为您不必使用?
然后匹配1+个数字,后跟一个下划线以及使用alternation列出的任何数字(?:500|1280|980)
。
对于示例数据,您可以匹配下划线或数字非贪婪[_\d]*?
而不是0+倍,而还可以匹配下划线\d+_
的1+数字
请注意转义点\.
以使其与字面值匹配。
https://subdomain\.domain\.com/[^,:"]+\d+_(?:500|1280|980)\.(?:txt|csv)
答案 2 :(得分:0)
据我了解,使用Regex几乎不可能实现这样的事情,我已经在不使用regex的情况下使用LINQ在C#中实现了这一点。感谢Burdui,在尝试您的建议时我想到了这个。
public List<string> FindUnique(List<string> Urls)
{
var distinct = Urls.Distinct();
var grouping = distinct.GroupBy(x => x.Substring(1, x.LastIndexOf('_')));
if (grouping.Count() > 0)
{
return grouping.Select(x =>
x.First(a =>
a.Contains(x.Max(y =>
Int32.Parse(y.Substring(y.LastIndexOf('_') + 1).Split('.')[0])).ToString())
)
).ToList();
}
else
{
return distinct.ToList();
}
}