Question

我需要一种算法来查找列表中匹配的对象对。这是一个示例案例：

class Human 
{
   int ID;
   string monthOfBirth;
   string country;
   string [] hobbies = {};
}

有大量的人类，问题是找到匹配的人对，这需要有效地完成，因为列表很大。

匹配条件：

出生月份和国家/地区必须完全匹配
两者都应该有超过x％的爱好匹配。

由于（2）标准，我们无法进行精确等于比较。

我能想到的方法是：

蛮力 - 将每个对象与每个其他对象进行比较。复杂度O（n ^ 2）
哈希表

对于哈希表方法，我正在考虑以下方式：

创建<String, List<Human>>（或MultiMap）
将每个人的出生月份和国家连接到一个字符串
使用此连接字符串散列到HashSet（两个人具有相同的出生月份和国家/地区必须提供相同的哈希码）
如果已有元素，请比较匹配的x％
如果匹配，那么这是重复的
如果爱好不匹配超过x％，则添加此人（链接列表方法）

有更好的方法吗？

连接月份和国家是否有意义？列表会很大，所以我假设，'更好'意味着存储量，而不是执行速度。

Answer 1

首先，你需要通过monthOfBirth + country将人类分类到水桶中。这应该是相当便宜的 - 只需遍历它们，将每一个弹出到适当的桶中。

请注意，附加字符串是解决此问题的“hacky”方法。 “正确”的方法是使用正确的hashCode方法创建一个关键对象：

 public class MonthCountryKey {
     String monthOfBirth;
     String country;
     // <snip> constructor, setters 
     @Override public int hashCode() {
         return Arrays.hashCode(new Object[] {
            monthOfBirth, 
            country,
         });
     }
     @Override public boolean equals(Object o) {
         ...
     }
 }

请参阅：What is a best practice of writing hash function in java?

Map<MonthCountryKey,List<Human>> buckets = new HashMap<List<Human>>;

while(Human human = humanSource.get()) {
    MonthCountryKey key = new MonthCountryKey(human.getMonthOfBirth(), human.getCountry());
    List list = buckets.get(key);
    if(list == null) {
       list = new ArrayList<Human>();
       buckets.put(key,list);
    }
    list.add(human);
}

请注意，还有其他种类的Set。例如，new TreeSet(monthCountryHumanComparator) - 使用Apache BeanUtils new TreeSet(new BeanComparator("monthOfBirth.country"))！

如果真的有 lot 人类，那么将桶存储在数据库中是值得的 - SQL或其他，如您所见。您只需要能够通过存储桶和列表索引号快速合理地获取它们。

然后你可以依次为每个桶应用一个爱好匹配算法，大大减少了暴力搜索的规模。

我看不出办法避免将桶中的每个人与同一桶中的每个其他人进行比较，但是你可以做一些工作来使比较便宜。

考虑将爱好编码为整数;每个爱好一点点。很长时间可以提供多达64个爱好。如果您需要更多，则需要更多整数或BigInteger（基准测试两种方法）。当你在人类中工作并遇到新的爱好时，你可以建立比特位置字典到爱好。比较两组爱好是一个便宜的二进制'＆amp;'然后是Long.bitCount（）。

为了说明，第一个人有爱好[ "cooking", "cinema" ]

所以右手位是“烹饪”，左边的下一位是“电影院”，这个人的编码爱好是二进制{60零} 00011 == 3

下一个人喜欢[ "cooking", "fishing" ]

所以fishing被添加到字典中，这个人类编码的爱好是{60零} 0101 = 5

 public long encodeHobbies(List<String> hobbies, BitPositionDictionary dict) {
      long encoded = 0;
      for(String hobby : hobbies) {
          int pos = dict.getPosition(hobby); // if not found, allocates new
          encoded &= (1 << pos)
      }
      return encoded;
 }

...与......

 public class BitPositionDictionary {
     private Map<String,Integer> positions = new HashMap<String,Integer>();
     private int nextPosition;
     public int getPosition(String s) {
         Integer i = positions.get(s);
         if(i == null) {
             i = nextPosition;
             positions.put(i,s);
             nextPosition++;
         }
         return i;
     }
 }

Binary＆amp;他们得到{60零} 0001; Long.bitCount（1）== 1.这两个人有一个共同的爱好。

要处理你的第三个人：[“钓鱼”，“夜总会”，“国际象棋”]，你的费用是：

添加到hobby-＆gt;位位置字典并编码为整数
与目前为止创建的所有二进制编码的爱好字符串进行比较

你想要将二进制编码的爱好存储在一个非常便宜的地方。我很想使用一个长数组，带有相应的人类索引：

  long[] hobbies = new long[numHumans];
  int size = 0;
  for(int i = 0; i<numHumans; i++) {
      hobby = encodeHobbies(humans.get(i).getHobbies(),
                             bitPositionDictionary);
      for(int j = 0; j<size; j++) {
          if(enoughBitsInCommon(hobbies[j], hobby)) {
              // just record somewhere cheap for later processing
              handleMatch(i,j); 
          }
      }
      hobbies[size++] = hobby;
  }

使用...

  // Clearly this could be extended to encodings of more than one long
  static boolean enoughBitsInCommon(long x, long y) {
      int numHobbiesX = Long.bitCount(x);
      int hobbiesInCommon = Long.bitCount(x & y);
      // used 128 in the hope that compiler will optimise!
      return ((hobbiesInCommon * 128) / numHobbiesX ) > MATCH_THRESHOLD;
  }

这样一来，如果很少有足够的爱好类型，你可以在一个1GB的数组中保留1.68亿套爱好：）

它应该是快速的;我认为RAM访问时间是这里的瓶颈。但这是一次强力搜索，并且仍然是O（n ²）

如果您正在谈论真正巨大的数据集，我怀疑这种方法适合使用MapReduce或其他任何方式进行分布式处理。

附加说明：您可以使用BitSet而不是long（s），并获得更多表现力;也许以某些表现为代价。再次，基准。

  long x,y;
  ...
  int numMatches = Long.bitCount(x & y);

  ... becomes

  BitSet x,y;
  ...
  int numMatches = x.and(y).cardinality();

两个字符串不同的位置数称为汉明距离，并且在cstheory上有一个已回答的问题。关于搜索具有接近汉明距离的对：https://cstheory.stackexchange.com/questions/18516/find-all-pairs-of-values-that-are-close-under-hamming-distance - 根据我对接受的答案的理解，这是一种方法，它会找到“非常高比例”的比赛，而不是所有比赛，我想这确实需要进行强力搜索。

Answer 2

哈希通常是要走的路。您可以作弊，而不是将月份和国家连接在一起，只需将这两个值的哈希码相加，即可形成组合哈希码;这样可以节省一些处理工作和内存使用。您还可以为记录定义.equals（）以实现您已描述的匹配逻辑，这将使哈希集直接检查匹配条目是否存在。

Answer 3

此结果假定您可以编写强力方法。有优化的空间，但一般来说这是正确的算法。

FindMatches (std::vector <Human> const & input, back_insert_iterator<vector> result)
{
  typedef std::pair <std::string, std::string> key_type;
  typedef std::vector <Human> Human_collection;

  typedef std::map <key_type, Human_collection> map_type;

  map_type my_map;

  for (ci = input.begin(); ci != input.end(); ++ci)
  {
    key_type my_key(ci->monthOfBirth, ci->country);

    my_map[my_key].push_back(*ci);
  }

  // Each value of my_map is now a collection of humans sharing the same birth statistics, which is the key.
  for (ci = my_map.begin(); ci != my_map.end(); ++ci)
  {
    FindMatches_BruteForce (ci->second, result);
  }

  return;
}

这里有很多可能的效率空间，比如你可以复制完整对象的指针，或者使用一些其他数据结构而不是地图，或者只是对输入容器进行就地排序。但算法上，我相信这是最好的。

有效地找到匹配的对象对

3 个答案: