我的问题比较两个字符串的最快速度(质量也很重要,但不太重要)?
我正在寻找比较两个字符串的最有效方法。我比较的一些字符串长度超过5000个字符。我将大约80个字符串的列表与另一个大约200个字符串的列表进行比较。它需要永远,即使我正在线程化它。我使用Apache Commons的StringUtils.getLevenshteinDistance(String s, String t)
方法。我的方法如下。有更好的方法吗?
private void compareMe() {
List<String> compareStrings = MainController.getInstance().getCompareStrings();
for (String compare : compareStrings) {
int levenshteinDistance = StringUtils.getLevenshteinDistance(me, compare);
if (bestScore > levenshteinDistance
&& levenshteinDistance > -1) {
bestScore = levenshteinDistance; //global variable
bestString = compare; //global variable
}
}
}
这里有一个两个字符串的样本,应该得分很高:
字符串1:
SELECT
CORP_VENDOR_NAME as "Corporate Vendor Name",
CORP_VENDOR_REF_ID as "Reference ID",
MERCHANT_ID as "Merchant ID",
VENDOR_CITY as "City",
VENDOR_STATE as "State",
VENDOR_ZIP as "Zip",
VENDOR_COUNTRY as "Country",
REMIT_VENDOR_NAME as "Remit Name",
REMIT_VENDOR_REF_ID as " Remit Reference ID",
VENDOR_PRI_UNSPSC_CODE as "Primary UNSPSC"
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE VENDOR_REFERENCE_ID in
(SELECT distinct CORP_VENDOR_REF_ID
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE CORP_VENDOR_REF_ID = '${request.corp_vendor_id};')
字符串2:
SELECT
CORP_VENDOR_NAME as "Corporate Vendor Name",
CORP_VENDOR_REF_ID as "Reference ID",
MERCHANT_ID as "Merchant ID",
VENDOR_CITY as "City",
VENDOR_STATE as "State",
VENDOR_ZIP as "Zip",
VENDOR_COUNTRY as "Country",
REMIT_VENDOR_NAME as "Remit Name",
REMIT_VENDOR_REF_ID as " Remit Reference ID",
VENDOR_PRI_UNSPSC_CODE as "Primary UNSPSC"
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE VENDOR_REFERENCE_ID in
(SELECT distinct CORP_VENDOR_REF_ID
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE CORP_VENDOR_REF_ID = 'ACQ-169013')
您会注意到唯一的区别是字符串末尾的'${request.corp_vendor_id};'
。这会导致其26
方法得分为LevenshteinDistance
。
答案 0 :(得分:2)
您应该考虑比较逻辑中的可能快捷方式,以避免一些计算。因此,如果你想要全局最小化Levensthein距离,如果字符串大小的差异高于你目前的最佳Levenshtein距离,你甚至不需要计算它。
E.g。如果你当前最好的Levenshtein距离是50,那么你可以避免比较两个大小为100和180的字符串,因为它们的Levenshtein距离至少为80.