查找indexOf另一个字节数组中的字节数组

时间:2014-01-24 19:36:17

标签: java search bytearray

给定一个字节数组,如何在其中找到(较小的)字节数组的位置?

This documentation看起来很有希望,使用ArrayUtils,但如果我是正确的,它只会让我在数组中找到要搜索的单个字节。

(我看不出它有关系,但以防万一:有时搜索字节数组将是常规ASCII字符,有时它将是控制字符或扩展的ASCII字符。因此使用字符串操作并不总是合适的)

大数组可能在10到10000个字节之间,较小的数组可能在10左右。在某些情况下,我会在单个搜索中在较大的数组中找到几个较小的数组。而且我有时想要找到实例的最后一个索引而不是第一个索引。

10 个答案:

答案 0 :(得分:35)

最简单的方法是比较每个元素:

public int indexOf(byte[] outerArray, byte[] smallerArray) {
    for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
        boolean found = true;
        for(int j = 0; j < smallerArray.length; ++j) {
           if (outerArray[i+j] != smallerArray[j]) {
               found = false;
               break;
           }
        }
        if (found) return i;
     }
   return -1;  
}  

一些测试:

@Test
public void testIndexOf() {
  byte[] outer = {1, 2, 3, 4};
  assertEquals(0, indexOf(outer, new byte[]{1, 2}));
  assertEquals(1, indexOf(outer, new byte[]{2, 3}));
  assertEquals(2, indexOf(outer, new byte[]{3, 4}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 4}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 5}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8}));
}

当您更新问题时:Java字符串是UTF-16字符串,它们不关心扩展的ASCII集,因此您可以使用string.indexOf()

答案 1 :(得分:22)

Google的Guava提供了Bytes.indexOf(byte []数组,byte []目标)。

答案 2 :(得分:6)

这是你在找什么?

public class KPM {
    /**
     * Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
     * @param data
     * @param start First index in data
     * @param stop Last index in data so that stop-start = length
     * @param pattern What is being searched. '*' can be used as wildcard for "ANY character"
     * @return
     */
    public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
        if( data == null || pattern == null) return -1;

        int[] failure = computeFailure(pattern);

        int j = 0;

        for( int i = start; i < stop; i++) {
            while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
                j = failure[j - 1];
            }
            if (pattern[j] == '*' || pattern[j] == data[i]) {
                j++;
            }
            if (j == pattern.length) {
                return i - pattern.length + 1;
            }
        }
        return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
        int[] failure = new int[pattern.length];

        int j = 0;
        for (int i = 1; i < pattern.length; i++) {
            while (j>0 && pattern[j] != pattern[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == pattern[i]) {
                j++;
            }
            failure[i] = j;
        }

        return failure;
    }
}

答案 3 :(得分:5)

节省测试时间:

http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence

为您提供的代码在您使 computeFailure()静态时有效:

public class KPM {
    /**
     * Search the data byte array for the first occurrence 
     * of the byte array pattern.
     */
    public static int indexOf(byte[] data, byte[] pattern) {
    int[] failure = computeFailure(pattern);

    int j = 0;

    for (int i = 0; i < data.length; i++) {
        while (j > 0 && pattern[j] != data[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == data[i]) { 
            j++; 
        }
        if (j == pattern.length) {
            return i - pattern.length + 1;
        }
    }
    return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
    int[] failure = new int[pattern.length];

    int j = 0;
    for (int i = 1; i < pattern.length; i++) {
        while (j>0 && pattern[j] != pattern[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == pattern[i]) {
            j++;
        }
        failure[i] = j;
    }

    return failure;
    }
}

由于测试您借用的代码总是明智的,您可以从:

开始
public class Test {
    public static void main(String[] args) {
        do_test1();
    }
    static void do_test1() {
      String[] ss = { "",
                    "\r\n\r\n",
                    "\n\n",
                    "\r\n\r\nthis is a test",
                    "this is a test\r\n\r\n",
                    "this is a test\r\n\r\nthis si a test",
                    "this is a test\r\n\r\nthis si a test\r\n\r\n",
                    "this is a test\n\r\nthis si a test",
                    "this is a test\r\nthis si a test\r\n\r\n",
                    "this is a test"
                };
      for (String s: ss) {
        System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
      }

    }
}

答案 4 :(得分:2)

Java字符串由16位char组成,而不是由8位byte组成。 char可以保存byte,因此您始终可以将字节数组转换为字符串,并使用indexOf:ASCII字符,控制字符甚至零字符都可以正常工作。

这是一个演示:

byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};
byte[] small = new byte[] {7,0,8,9,0,0,1};
String bigStr = new String(big, StandardCharsets.UTF_8);
String smallStr = new String(small, StandardCharsets.UTF_8);
System.out.println(bigStr.indexOf(smallStr));

This prints 7.

但是,考虑到你的大型数组可能高达10,000个字节,而小数组只有10个字节,这个解决方案可能不是最有效的,原因有两个:

  • 它需要将您的大数组复制到两倍大的数组(相同容量,但使用char而不是byte)。这会使你的记忆需求增加三倍。
  • Java的字符串搜索算法并不是最快的。如果您实施其中一种高级算法(例如Knuth–Morris–Pratt),则可能会更快。这可能会使执行速度降低十倍(小字符串的长度),并且需要额外的内存,这与小字符串的长度成正比,而不是大字符串。

答案 5 :(得分:2)

使用Knuth–Morris–Pratt algorithm是最有效的方法。

StreamSearcher.java是它的一个实现,是Twitter的{​​{1}}项目的一部分。

建议不要包含这个库,因为只使用一个类就相当大。

elephant-bird

答案 6 :(得分:1)

package org.example;

import java.util.List;

import org.riversun.finbin.BinarySearcher;

public class Sample2 {

    public static void main(String[] args) throws Exception {

        BinarySearcher bs = new BinarySearcher();

        // UTF-8 without BOM
        byte[] srcBytes = "Hello world.It's a small world.".getBytes("utf-8");

        byte[] searchBytes = "world".getBytes("utf-8");

        List<Integer> indexList = bs.searchBytes(srcBytes, searchBytes);

        System.out.println("indexList=" + indexList);
    }
 }

因此导致

indexList=[6, 25]

所以,你可以在byte []

中找到byte []的索引

Github上的示例:https://github.com/riversun/finbin

答案 7 :(得分:1)

从java.lang.String复制几乎相同的内容。

indexOf(char[],int,int,char[]int,int,int)

static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
    if (fromIndex >= sourceCount) {
        return (targetCount == 0 ? sourceCount : -1);
    }
    if (fromIndex < 0) {
        fromIndex = 0;
    }
    if (targetCount == 0) {
        return fromIndex;
    }

    byte first = target[targetOffset];
    int max = sourceOffset + (sourceCount - targetCount);

    for (int i = sourceOffset + fromIndex; i <= max; i++) {
        /* Look for first character. */
        if (source[i] != first) {
            while (++i <= max && source[i] != first)
                ;
        }

        /* Found first character, now look at the rest of v2 */
        if (i <= max) {
            int j = i + 1;
            int end = j + targetCount - 1;
            for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
                ;

            if (j == end) {
                /* Found whole string. */
                return i - sourceOffset;
            }
        }
    }
    return -1;
}

答案 8 :(得分:1)

此处发布的几个(或全部?)示例未通过某些单元测试,因此我将我的版本与上述测试一起发布到此处。所有单元测试都基于 Java 的 String.indexOf() 总是给我们正确答案的要求!

// The Knuth, Morris, and Pratt string searching algorithm remembers information about
// the past matched characters instead of matching a character with a different pattern
// character over and over again. It can search for a pattern in O(n) time as it never
// re-compares a text symbol that has matched a pattern symbol. But, it does use a partial
// match table to analyze the pattern structure. Construction of a partial match table
// takes O(m) time. Therefore, the overall time complexity of the KMP algorithm is O(m + n).

public class KMPSearch {

    public static int indexOf(byte[] haystack, byte[] needle)
    {
        // needle is null or empty
        if (needle == null || needle.length == 0)
            return 0;

        // haystack is null, or haystack's length is less than that of needle
        if (haystack == null || needle.length > haystack.length)
            return -1;

        // pre construct failure array for needle pattern
        int[] failure = new int[needle.length];
        int n = needle.length;
        failure[0] = -1;
        for (int j = 1; j < n; j++)
        {
            int i = failure[j - 1];
            while ((needle[j] != needle[i + 1]) && i >= 0)
                i = failure[i];
            if (needle[j] == needle[i + 1])
                failure[j] = i + 1;
            else
                failure[j] = -1;
        }

        // find match
        int i = 0, j = 0;
        int haystackLen = haystack.length;
        int needleLen = needle.length;
        while (i < haystackLen && j < needleLen)
        {
            if (haystack[i] == needle[j])
            {
                i++;
                j++;
            }
            else if (j == 0)
                i++;
            else
                j = failure[j - 1] + 1;
        }
        return ((j == needleLen) ? (i - needleLen) : -1);
    }
}



import java.util.Random;

class KMPSearchTest {
    private static Random random = new Random();
    private static String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";

    @Test
    public void testEmpty() {
        test("", "");
        test("", "ab");
    }

    @Test
    public void testOneChar() {
        test("a", "a");
        test("a", "b");
    }

    @Test
    public void testRepeat() {
        test("aaa", "aaaaa");
        test("aaa", "abaaba");
        test("abab", "abacababc");
        test("abab", "babacaba");
    }

    @Test
    public void testPartialRepeat() {
        test("aaacaaaaac", "aaacacaacaaacaaaacaaaaac");
        test("ababcababdabababcababdaba", "ababcababdabababcababdaba");
    }

    @Test
    public void testRandomly() {
        for (int i = 0; i < 1000; i++) {
            String pattern = randomPattern();
            for (int j = 0; j < 100; j++)
                test(pattern, randomText(pattern));
        }
    }

    /* Helper functions */
    private static String randomPattern() {
        StringBuilder sb = new StringBuilder();
        int steps = random.nextInt(10) + 1;
        for (int i = 0; i < steps; i++) {
            if (sb.length() == 0 || random.nextBoolean()) {  // Add literal
                int len = random.nextInt(5) + 1;
                for (int j = 0; j < len; j++)
                    sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
            } else {  // Repeat prefix
                int len = random.nextInt(sb.length()) + 1;
                int reps = random.nextInt(3) + 1;
                if (sb.length() + len * reps > 1000)
                    break;
                for (int j = 0; j < reps; j++)
                    sb.append(sb.substring(0, len));
            }
        }
        return sb.toString();
    }

    private static String randomText(String pattern) {
        StringBuilder sb = new StringBuilder();
        int steps = random.nextInt(100);
        for (int i = 0; i < steps && sb.length() < 10000; i++) {
            if (random.nextDouble() < 0.7) {  // Add prefix of pattern
                int len = random.nextInt(pattern.length()) + 1;
                sb.append(pattern.substring(0, len));
            } else {  // Add literal
                int len = random.nextInt(30) + 1;
                for (int j = 0; j < len; j++)
                    sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
            }
        }
        return sb.toString();
    }

    private static void test(String pattern, String text) {
        try {
            assertEquals(text.indexOf(pattern), KMPSearch.indexOf(text.getBytes(), pattern.getBytes()));
        } catch (AssertionError e) {
            System.out.println("FAILED -> Unable to find '" + pattern + "' in '" + text + "'");
        }
    }
}

答案 9 :(得分:0)

对于我当前正在使用的一个小型HTTP服务器,我想出了以下代码来查找multipart / form-data请求中的边界。希望在这里找到更好的解决方案,但我想我会坚持下去。我认为这是可以达到的效果(相当快并且不使用太多内存)。它使用输入字节作为环形缓冲区,在边界不匹配时立即读取下一个字节,并将第一个完整周期后的数据写入输出流。当然,可以按照问题中的说明,将其更改为字节数组而不是流。

    private boolean multipartUploadParseOutput(InputStream is, OutputStream os, String boundary)
    {
        try
        {
            String n = "--"+boundary;
            byte[] bc = n.getBytes("UTF-8");
            int s = bc.length;
            byte[] b = new byte[s];
            int p = 0;
            long l = 0;
            int c;
            boolean r;
            while ((c = is.read()) != -1)
            {
                b[p] = (byte) c;
                l += 1;
                p = (int) (l % s);
                if (l>p)
                {
                    r = true;
                    for (int i = 0; i < s; i++)
                    {
                        if (b[(p + i) % s] != bc[i])
                        {
                            r = false;
                            break;
                        }
                    }
                    if (r)
                        break;
                    os.write(b[p]);
                }
            }
            os.flush();
            return true;
        } catch(IOException e) {e.printStackTrace();}
        return false;
    }