Question

我正在开发一个关于Spring Boot的项目，并且必须处理存储在Solr中的大量信息。我必须将所有存储的图像与用户输入的图像进行比较并建立相似性。我在开始时使用了LinkedList的图像，现在使用Arrays和LinkedList，但也很慢，有时也无法正常工作。我说的是我必须处理的11 000 000张图像。这是我的代码：

const sizes = {
    desktop: 992
}

const media = Object.keys(sizes).reduce((acc, label) => {

    acc[label] = (literals: TemplateStringsArray, ...placeholders: any[]) => css`      
    @media(max-width: ${sizes[label]}px) {
        ${css(literals, ...placeholders)}
    }

    `;
    return acc
}, {} as Record<keyof typeof sizes, (l: TemplateStringsArray, ...p: any[]) => string>)

我可以使用哪种数据结构来加快处理速度。我也在考虑在线程中进行每个比较public LinkedList<Imagen> comparar(Imagen[] lista, Imagen imagen) throws NullPointerException { LinkedList<Imagen> resultado = new LinkedList<>(); for (int i = 0; i < lista.length; i++) { if (lista[i].getFacesDetectedQuantity() == imagen.getFacesDetectedQuantity()) { lista[i].setSimilitud(3); } if (herramientas.rangoHue(imagen.getPredominantColor_hue()).equals(herramientas.rangoHue(lista[i].getPredominantColor_hue()))) { lista[i].setSimilitud(3); } if (lista[i].isTransparency() == imagen.isTransparency()) { lista[i].setSimilitud(4); } if (analizar.compareFeature(herramientas.image64ToImage(lista[i].getLarge_thumbnail()), herramientas.image64ToImage(imagen.getLarge_thumbnail())) > 400) { lista[i].setSimilitud(3); } if (analizar.compare_histogram(herramientas.image64ToImage(lista[i].getLarge_thumbnail()), herramientas.image64ToImage(imagen.getLarge_thumbnail())) > 90) { lista[i].setSimilitud(3); } if (lista[i].getSimilitud() > 7) { resultado.add(lista[i]); } } return ordenarLista(resultado); } public LinkedList<Imagen> ordenarLista(LinkedList<Imagen> lista) { LinkedList<Imagen> resultado = new LinkedList<>(); for (int y = 0; y < lista.size(); y++) { Imagen imagen = lista.get(0); int posicion = 0; for (int x = 0; x < lista.size(); x++) { if (lista.get(x).getSimilitud() > imagen.getSimilitud()) { imagen = lista.get(x); posicion = x; } } resultado.add(imagen); lista.remove(posicion); } return resultado; }，但也不知道如何做到这一点。很多谷歌搜索没有找到。对不起，我的英文和谢谢!!!

我解决了使用if方法排序的问题，只是忽略它并在返回列表之前在我的ordenarLista()方法上添加此代码。

comparar()

仍在研究我的算法！

Answer 1

一般来说，在尝试随机优化任何部分之前，请使用监视工具作为JVisualVM来准确检测昂贵的调用。你必须把努力放在正确的地方。

此外，追踪第一次大处理（->之前）和第二次大处理（ordenarLista()）所用的时间也应该有用。

实际上，我注意到了一些事情：

1）很可能是一个问题：ordenarLista()做了很多复制处理，这在CPU方面可能很昂贵。

看看这两个调用：

comparar()

例如，在每次迭代时调用4次if (analizar.compareFeature(herramientas.image64ToImage(lista[i].getLarge_thumbnail()), herramientas.image64ToImage(imagen.getLarge_thumbnail())) > 400) { lista[i].setSimilitud(3); } if (analizar.compare_histogram(herramientas.image64ToImage(lista[i].getLarge_thumbnail()), herramientas.image64ToImage(imagen.getLarge_thumbnail())) > 90) { lista[i].setSimilitud(3); }。

这应该在循环之前执行一次：

herramientas.image64ToImage()

但是你在循环中执行了数百万次。只需将结果存储在循环之前的变量中并使用它。同样的事情：

herramientas.image64ToImage(imagen.getLarge_thumbnail())

所有仅依赖于herramientas.rangoHue(imagen.getPredominantColor_hue()参数的计算应该在循环之前计算，而不是为了节省数百万计算。

2）Imagen imagen似乎有问题：你在这里硬编码了第一个索引：

ordenarLista()

3）Imagen imagen = lista.get(0);可能会多次迭代：

ordenarLista()

想象一下lista.size() + lista.size() + lista.size()-1 + lista.size() + lista.size()-2 + lista.size() + ... + 1 * lista.size()元素：

1.000.000

它赚了数百万......

Answer 2

如果您使用的是get(int)，则肯定会使用ArrayList，而不是LinkedList。

然而，这不仅仅是你的数据结构，而是你可怕的算法。

例如，在ordenarLista()方法中，lista.get(0)应为lista.get(y)，posicion = 0应为posicion = y，内循环应从y+1开始{1}}。不是零。

否则根本不需要外部循环。

Answer 3

真的不明白你做了什么。但我认为你是在线性地搜索东西，只是有些东西如果不能让事情变得更好。使用BTree算法进行排序和搜索可能是一个好主意，每个数据库都使用该算法。如您所见，数据库通常在查询记录方面做得很好。

BTree java示例。

万一你不明白BTree是什么：Wikipedia

但绝不使用真正的数据库来存储图像。 Reason

Answer 4

似乎您可能需要这方面的帮助：java.util.concurrent.Future。

您可以尝试使用此for loop拆分public LinkedList<Imagen> comparar(Imagen[] lista, Imagen imagen) java.util.concurrent.Future，并查看它是否缩短了处理时间。

如果速度降低，您可以再次在java.util.concurrent.Future的for循环中添加public LinkedList<Imagen> ordenarLista(LinkedList<Imagen> lista)

Answer 5

I suspect using a 'List' may not be a good choice for what you are doing (this answer contains quite some guess, as I'm not quite sure of the intentions of your program, I still hope it is useful).

If your program tries to detect similar images, there are already a number of algorithms and library to compare image similarity, for example here.

If you don't want to change you approach too much, or if it's not about image similarity, a multi-dimensional index may be something you should look at.

For example, it looks like you calculate certain values for each image (hue, a number of histogram values, number of faces). You can pre-calculate all these values once for each image and then put them in a large vector/array:

double[] calculateVector(Image image) {
    //put all image characteristics into a single array
    double[] vector = new double[]{hue, #of faces, hist value 1, histo value 2, ...};
    return vector;
}

This may give you one vector/array per image with, say, 20 'double' value. Then you use a multi-dimensional index, such as KD-Tree or R*Tree (there are some sample implementations in my own project).

KDTree allImages = new KDTree(20);
for (Image image : all images) {
    double[] vector = calculateVector(image);
    kdtree.put(vector, image);
}

Now if you have a new image and want to find the 5 most similar images you calculate the vector/array for the new image and the perform a kNN-query (k-nearest neighbor query) on the index.

double[] newImageVector = calculateVector(newImage); 
List result = kdtree.queryKNN(newImageVector, 5);  //method names may vary between implementation

This gives you a list of the 5 most similar images. This is usually very fast, the complexity is about O(log n), and you should be able to execute it several 1000 times per second. If you want to know more about multi-dim indexing, search the web for 'kNN query'.

对于大量数据，哪种数据结构更快？

5 个答案: