比较字符串方法更好

时间:2012-05-24 18:46:32

标签: java string apache-commons string-comparison levenshtein-distance

我的问题比较两个字符串的最快速度(质量也很重要,但不太重要)?

我正在寻找比较两个字符串的最有效方法。我比较的一些字符串长度超过5000个字符。我将大约80个字符串的列表与另一个大约200个字符串的列表进行比较。它需要永远,即使我正在线程化它。我使用Apache Commons的StringUtils.getLevenshteinDistance(String s, String t)方法。我的方法如下。有更好的方法吗?

private void compareMe() {
  List<String> compareStrings = MainController.getInstance().getCompareStrings();
  for (String compare : compareStrings) {
    int levenshteinDistance = StringUtils.getLevenshteinDistance(me, compare);
    if (bestScore > levenshteinDistance
          && levenshteinDistance > -1) {
      bestScore = levenshteinDistance; //global variable
      bestString = compare; //global variable
    }
  }
}

这里有一个两个字符串的样本,应该得分很高:

字符串1:

SELECT 
CORP_VENDOR_NAME as "Corporate Vendor Name",
CORP_VENDOR_REF_ID as "Reference ID",
MERCHANT_ID as "Merchant ID",
VENDOR_CITY as "City",
VENDOR_STATE as "State",
VENDOR_ZIP as "Zip",
VENDOR_COUNTRY as "Country",
REMIT_VENDOR_NAME as "Remit Name",
REMIT_VENDOR_REF_ID as " Remit Reference ID",
VENDOR_PRI_UNSPSC_CODE as "Primary UNSPSC"
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE VENDOR_REFERENCE_ID in 
(SELECT distinct CORP_VENDOR_REF_ID
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE CORP_VENDOR_REF_ID = '${request.corp_vendor_id};')

字符串2:

SELECT 
CORP_VENDOR_NAME as "Corporate Vendor Name",
CORP_VENDOR_REF_ID as "Reference ID",
MERCHANT_ID as "Merchant ID",
VENDOR_CITY as "City",
VENDOR_STATE as "State",
VENDOR_ZIP as "Zip",
VENDOR_COUNTRY as "Country",
REMIT_VENDOR_NAME as "Remit Name",
REMIT_VENDOR_REF_ID as " Remit Reference ID",
VENDOR_PRI_UNSPSC_CODE as "Primary UNSPSC"
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE VENDOR_REFERENCE_ID in 
(SELECT distinct CORP_VENDOR_REF_ID
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE CORP_VENDOR_REF_ID = 'ACQ-169013')

您会注意到唯一的区别是字符串末尾的'${request.corp_vendor_id};'。这会导致其26方法得分为LevenshteinDistance

1 个答案:

答案 0 :(得分:2)

您应该考虑比较逻辑中的可能快捷方式,以避免一些计算。因此,如果你想要全局最小化Levensthein距离,如果字符串大小的差异高于你目前的最佳Levenshtein距离,你甚至不需要计算它。

E.g。如果你当前最好的Levenshtein距离是50,那么你可以避免比较两个大小为100和180的字符串,因为它们的Levenshtein距离至少为80.