Question

我最初问过这个问题：Regular Expression in gVim to Remove Duplicate Domains from a List

然而，我意识到，如果我“扩大我的范围”，就我愿意接受的解决方案而言，我可能更有可能找到一个有效的解决方案。

所以，我会改写我的问题＆amp;也许我会得到一个更好的解决方案......这里有：

我在.txt文件中有一个很大的URL列表（我正在运行Windows Vista 32位），我需要删除重复的DOMAINS（以及每个副本的完整相应URL），同时留下每个域的第一个匹配项。这个特定文件中大约有6,000,000个URL，格式如下（URL显然没有空格，我只是必须这样做，因为我这里没有足够的帖子来发布那么多“实时”URL ）：

http://www.exampleurl.com/something.php
http://exampleurl.com/somethingelse.htm  
http://exampleurl2.com/another-url  
http://www.exampleurl2.com/a-url.htm  
http://exampleurl2.com/yet-another-url.html  
http://exampleurl.com/  
http://www.exampleurl3.com/here_is_a_url  
http://www.exampleurl5.com/something

无论解决方案是什么，使用上面作为输入的输出文件应该是：

http://www.exampleurl.com/something.php  
http://exampleurl2.com/another-url  
http://www.exampleurl3.com/here_is_a_url  
http://www.exampleurl5.com/something

您注意到现在没有重复的域名，并且它在第一次出现时遗留下来。

如果有人可以帮助我，无论是使用正则表达式还是某些我不知道的程序，那都会很棒。

我会说这个，但我没有经验使用Windows操作系统以外的任何东西，因此除了Windows程序之外的某种解决方案需要一点点“宝贝踩踏”可以这么说（如果有人够的话）这样做。）

Answer 1

Python中的正则表达式，非常原始，不适用于子域。基本概念是使用字典键和值，键将是域名，如果键已经存在，则值将被覆盖。

import re

pattern = re.compile(r'(http://?)(w*)(\.*)(\w*)(\.)(\w*)')
urlsFile = open("urlsin.txt", "r")
outFile = open("outurls.txt", "w")
urlsDict = {}

for linein in urlsFile.readlines():
    match = pattern.search(linein)
    url = match.groups()
    domain = url[3]
    urlsDict[domain] = linein

outFile.write("".join(urlsDict.values()))

urlsFile.close()
outFile.close()

你可以扩展它来过滤掉子域名，但我认为基本的想法是存在的。对于600万个URL，在Python中可能需要一段时间...

有些人在面对的时候问题，想想“我知道，我会用正则表达式。“现在他们有两个问题。 --Jamie Zawinski，in comp.emacs.xemacs

Answer 2

对于这种特殊情况，我不会使用正则表达式。 URL是一种定义良好的格式，在BCL中存在一种易于使用的格式的解析器：Uri类型。它可用于轻松解析类型并获取您寻找的域信息。

这是一个简单的例子

public List<string> GetUrlWithUniqueDomain(string file) {
  using ( var reader = new StreamReader(file) ) {
    var list = new List<string>();
    var found = new HashSet<string>();
    var line = reader.ReadLine();
    while (line != null) {
      Uri uri;
      if ( Uri.TryCreate(line, UriKind.Absolute, out uri) && found.Add(uri.Host)) {
        list.Add(line);
      }
      line = reader.ReadLine();
    }
  }
  return list;
}

Answer 3

我会使用Perl和regexps的组合。我的第一个版本

   use warnings ;
   use strict ;
   my %seen ;
   while (<>) {
       if ( m{ // ( .*? ) / }x ) {
       my $dom = $1 ;

       print unless $seen {$dom} ++ ;
       print "$dom\n" ;
     } else {
       print "Unrecognised line: $_" ;
     }
   }

但这会将www.exampleurl.com和exampleurl.com视为不同。我的第二个版本有

if ( m{ // (?:www\.)? ( .*? ) / }x )

忽略“www。”在前面。你可以稍微改进一下regexp，但这留给了读者。

最后你可以稍微评论一下regexp（/x限定符允许这个）。这取决于谁将会阅读它 - 它可能被视为过于冗长。

           if ( m{
               //          # match double slash
               (?:www\.)?  # ignore www
               (           # start capture
                  .*?      # anything but not greedy
                )          # end capture
                /          # match /
               }x ) {

我使用m{}而不是//来避免/\/\/

Answer 4

找一个unix框，如果你没有，或者获得cygwin
使用tr转换'。'到TAB方便。
使用sort（1）按域名部分对行进行排序。通过编写awk程序来规范化www部分，可以使这更容易。

çava，你有重复的一起。使用也许使用uniq（1）来找到dublicates。

（额外的功劳：为什么不能单独使用正则表达式？计算机科学专业的学生应该考虑抽吸的外表。）

如何从大型URL列表中删除重复域？ RegEx或其他

4 个答案: