我想制作一个程序来检测具有3个严重性级别的潜在重复项。 让我认为我的数据只有两列,但有几千行。 第二列中的数据仅用逗号分隔。数据示例:
Number | Material
1 | helmet,valros,42
2 | helmet,iron,knight
3 | valros,helmet,42
4 | knight,helmet
5 | valros,helmet,42
6 | plain,helmet
7 | helmet, leather
我的3个等级是:
非常高: A,B,C vs A,B,C
高: A,B,C vs B,C,A
如此: A,B,C vs A,B
到目前为止我只能做第一级,我不知道如何做第二级和第三级。
我尝试过的。
Sub duplicates_separation()
Dim duplicate(), i As Long
Dim delrange As Range, cell As Long
Dim shtIn As Worksheet, shtOut As Worksheet
Set shtIn = ThisWorkbook.Sheets("input")
Set shtOut = ThisWorkbook.Sheets("output")
x = 2
y = 1
Set delrange = shtIn.Range("b1:b10000") 'set your range here
ReDim duplicate(0)
'search duplicates in 2nd column
For cell = 1 To delrange.Cells.Count
If Application.CountIf(delrange, delrange(cell)) > 1 Then
ReDim Preserve duplicate(i)
duplicate(i) = delrange(cell).Address
i = i + 1
End If
Next
'print duplicates
For i = UBound(duplicate) To LBound(duplicate) Step -1
shtOut.Cells(x, 1).EntireRow.Value = shtIn.Range(duplicate(i)).EntireRow.Value
End Sub
程序检测到的重复项:
3 | valros,helmet,42
5 | valros,helmet,42
我的期望:
Number | Material
1 | helmet,valros,42
3 | valros,helmet,42
5 | valros,helmet,42
4 | knight,helmet
2 | helmet,iron,knight
我有一个检测重复lv 2的想法,但我认为它会如此复杂并使程序变慢。
有没有办法检测第二个&第三级重复?
更新
昨天我去朋友家咨询了这个问题,但他的解决方案是在JAVA languange ..>我不明白
public class ali {
static void sPrint(String[] Printed) {
for (int iC = 0; iC < Printed.length; iC++) {
System.out.print(String.valueOf(Printed[iC]) + " | ");
}
System.out.println();
}
public static void main(String Args[]) {
int defaultLength = 10;
int indexID = 0;
int indexDesc = 1;
String[] DETECTORP1 = new String[defaultLength];
String[] DETECTORP2 = new String[defaultLength];
String[] DETECTORP3 = new String[defaultLength];
String[] DETECTORP4 = new String[defaultLength];
String[][] theString = new String[5][2];
theString[0] = new String[]{"1", "A, B, C, D"};
theString[1] = new String[]{"2", "A, B, C, D"};
theString[2] = new String[]{"3", "A, B, C, D, E"};
theString[3] = new String[]{"4", "A, B, D, C, E"};
theString[4] = new String[]{"5", "A, B, D, C, E, F"};
int P1 = 0;
int P2 = 0;
int P3 = 0;
int P4 = 0;
for (int iC = 0; iC < theString.length; iC++) {
System.out.println(theString[iC][indexID] + " -> " + theString[iC][indexDesc]);
}
for (int iC = 0; iC < theString.length; iC++) {
int LEX;
String theReference[] = theString[iC][indexDesc].replace(",", ";;").split(";;");
for (int iD = 0; iD < theString.length; iD++) {
if (iC != iD) {
String theCompare[] = theString[iD][1].replace(",", ";;").split(";;");
if (theReference.length == theCompare.length) {
LEX=0;
int theLength = theReference.length;
for (int iE = 0; iE < theLength; iE++) {
if (theReference[iE].equals(theCompare[iE])) {
LEX += 1;
}
}
if (LEX == theLength) {
DETECTORP1[P1] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P1 += 1;
} else {
LEX = 0;
for (int iF = 0; iF < theReference.length; iF++) {
for (int iG = 0; iG < theCompare.length; iG++) {
if (theReference[iF].equals(theCompare[iG])) {
LEX += 1;
break;
}
}
}
if (LEX == theReference.length) {
DETECTORP2[P2] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P2 += 1;
}
}
} else {
LEX = 0;
if (theReference.length > theCompare.length) {
for (int iF = 0; iF < theReference.length; iF++) {
for (int iG = 0; iG < theCompare.length; iG++) {
if (iG == iF) {
if (theReference[iF].equals(theCompare[iF])) {
LEX += 1;
break;
}
}
}
}
if (LEX <= theReference.length && LEX >= theCompare.length) {
DETECTORP3[P3] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P3 += 1;
}
} else {
LEX =0;
for (int iF = 0; iF < theCompare.length; iF++) {
for (int iG = 0; iG < theReference.length; iG++) {
if (iG == iF) {
if (theCompare[iF].equals(theReference[iF])) {
LEX += 1;
// System.out.println(theReference[iG] + "==" + theCompare[iG]);
break;
}
}
}
}
if (LEX <= theCompare.length && LEX >= theReference.length) {
DETECTORP3[P3] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P3 += 1;
}
}
}
}
}
}
sPrint(DETECTORP1);
sPrint(DETECTORP2);
sPrint(DETECTORP3);
}
}
如何在VBA中执行此操作?
答案 0 :(得分:1)
实际上,这取决于您希望如何定义“严重性级别”。这是一种方法,不一定是最好的:使用Levensthein距离。
通过单字符属性符号表示每个项目,例如
H helmet
K knight
I iron
$ Leather
^ Valros
╔ Plain
¢ Whatever
etc.
然后将您的材料列表转换为包含表示这些属性的字符序列的字符串:
HIK = helmet,iron,knight
¢H = plain,helmet
然后计算这两个字符串之间的Levenshtein距离。那将是你的“严重程度”。
Debug.Print LevenshteinDistance("HIK","¢H")
'returns 3
Levenshtein距离的两个实现显示在Wikipedia中。确实你很幸运:StackOverflow ported this to VBA上有人。
在下面的评论部分中,您说您不希望用单字符符号表示每个可能的属性。很公平;我同意这有点傻。解决方法:实际上,可以调整Levenshtein距离算法以查找字符串中的每个字符,而不是查找数组的每个元素,并根据它进行比较。我将在my answer中向您的后续问题展示如何进行此更改。