给定一个字节数组,如何在其中找到(较小的)字节数组的位置?
This documentation看起来很有希望,使用ArrayUtils
,但如果我是正确的,它只会让我在数组中找到要搜索的单个字节。
(我看不出它有关系,但以防万一:有时搜索字节数组将是常规ASCII字符,有时它将是控制字符或扩展的ASCII字符。因此使用字符串操作并不总是合适的)
大数组可能在10到10000个字节之间,较小的数组可能在10左右。在某些情况下,我会在单个搜索中在较大的数组中找到几个较小的数组。而且我有时想要找到实例的最后一个索引而不是第一个索引。
答案 0 :(得分:35)
最简单的方法是比较每个元素:
public int indexOf(byte[] outerArray, byte[] smallerArray) {
for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
boolean found = true;
for(int j = 0; j < smallerArray.length; ++j) {
if (outerArray[i+j] != smallerArray[j]) {
found = false;
break;
}
}
if (found) return i;
}
return -1;
}
一些测试:
@Test
public void testIndexOf() {
byte[] outer = {1, 2, 3, 4};
assertEquals(0, indexOf(outer, new byte[]{1, 2}));
assertEquals(1, indexOf(outer, new byte[]{2, 3}));
assertEquals(2, indexOf(outer, new byte[]{3, 4}));
assertEquals(-1, indexOf(outer, new byte[]{4, 4}));
assertEquals(-1, indexOf(outer, new byte[]{4, 5}));
assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8}));
}
当您更新问题时:Java字符串是UTF-16字符串,它们不关心扩展的ASCII集,因此您可以使用string.indexOf()
答案 1 :(得分:22)
Google的Guava提供了Bytes.indexOf(byte []数组,byte []目标)。
答案 2 :(得分:6)
这是你在找什么?
public class KPM {
/**
* Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
* @param data
* @param start First index in data
* @param stop Last index in data so that stop-start = length
* @param pattern What is being searched. '*' can be used as wildcard for "ANY character"
* @return
*/
public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
if( data == null || pattern == null) return -1;
int[] failure = computeFailure(pattern);
int j = 0;
for( int i = start; i < stop; i++) {
while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
j = failure[j - 1];
}
if (pattern[j] == '*' || pattern[j] == data[i]) {
j++;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j>0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}
答案 3 :(得分:5)
节省测试时间:
http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence
为您提供的代码在您使 computeFailure()静态时有效:
public class KPM {
/**
* Search the data byte array for the first occurrence
* of the byte array pattern.
*/
public static int indexOf(byte[] data, byte[] pattern) {
int[] failure = computeFailure(pattern);
int j = 0;
for (int i = 0; i < data.length; i++) {
while (j > 0 && pattern[j] != data[i]) {
j = failure[j - 1];
}
if (pattern[j] == data[i]) {
j++;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j>0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}
由于测试您借用的代码总是明智的,您可以从:
开始public class Test {
public static void main(String[] args) {
do_test1();
}
static void do_test1() {
String[] ss = { "",
"\r\n\r\n",
"\n\n",
"\r\n\r\nthis is a test",
"this is a test\r\n\r\n",
"this is a test\r\n\r\nthis si a test",
"this is a test\r\n\r\nthis si a test\r\n\r\n",
"this is a test\n\r\nthis si a test",
"this is a test\r\nthis si a test\r\n\r\n",
"this is a test"
};
for (String s: ss) {
System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
}
}
}
答案 4 :(得分:2)
Java字符串由16位char
组成,而不是由8位byte
组成。 char
可以保存byte
,因此您始终可以将字节数组转换为字符串,并使用indexOf
:ASCII字符,控制字符甚至零字符都可以正常工作。
这是一个演示:
byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};
byte[] small = new byte[] {7,0,8,9,0,0,1};
String bigStr = new String(big, StandardCharsets.UTF_8);
String smallStr = new String(small, StandardCharsets.UTF_8);
System.out.println(bigStr.indexOf(smallStr));
但是,考虑到你的大型数组可能高达10,000个字节,而小数组只有10个字节,这个解决方案可能不是最有效的,原因有两个:
char
而不是byte
)。这会使你的记忆需求增加三倍。答案 5 :(得分:2)
使用Knuth–Morris–Pratt algorithm
是最有效的方法。
StreamSearcher.java
是它的一个实现,是Twitter
的{{1}}项目的一部分。
建议不要包含这个库,因为只使用一个类就相当大。
elephant-bird
答案 6 :(得分:1)
package org.example;
import java.util.List;
import org.riversun.finbin.BinarySearcher;
public class Sample2 {
public static void main(String[] args) throws Exception {
BinarySearcher bs = new BinarySearcher();
// UTF-8 without BOM
byte[] srcBytes = "Hello world.It's a small world.".getBytes("utf-8");
byte[] searchBytes = "world".getBytes("utf-8");
List<Integer> indexList = bs.searchBytes(srcBytes, searchBytes);
System.out.println("indexList=" + indexList);
}
}
因此导致
indexList=[6, 25]
所以,你可以在byte []
中找到byte []的索引Github上的示例:https://github.com/riversun/finbin
答案 7 :(得分:1)
从java.lang.String复制几乎相同的内容。
indexOf(char[],int,int,char[]int,int,int)
static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
byte first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first)
;
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
;
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}
答案 8 :(得分:1)
此处发布的几个(或全部?)示例未通过某些单元测试,因此我将我的版本与上述测试一起发布到此处。所有单元测试都基于 Java 的 String.indexOf() 总是给我们正确答案的要求!
// The Knuth, Morris, and Pratt string searching algorithm remembers information about
// the past matched characters instead of matching a character with a different pattern
// character over and over again. It can search for a pattern in O(n) time as it never
// re-compares a text symbol that has matched a pattern symbol. But, it does use a partial
// match table to analyze the pattern structure. Construction of a partial match table
// takes O(m) time. Therefore, the overall time complexity of the KMP algorithm is O(m + n).
public class KMPSearch {
public static int indexOf(byte[] haystack, byte[] needle)
{
// needle is null or empty
if (needle == null || needle.length == 0)
return 0;
// haystack is null, or haystack's length is less than that of needle
if (haystack == null || needle.length > haystack.length)
return -1;
// pre construct failure array for needle pattern
int[] failure = new int[needle.length];
int n = needle.length;
failure[0] = -1;
for (int j = 1; j < n; j++)
{
int i = failure[j - 1];
while ((needle[j] != needle[i + 1]) && i >= 0)
i = failure[i];
if (needle[j] == needle[i + 1])
failure[j] = i + 1;
else
failure[j] = -1;
}
// find match
int i = 0, j = 0;
int haystackLen = haystack.length;
int needleLen = needle.length;
while (i < haystackLen && j < needleLen)
{
if (haystack[i] == needle[j])
{
i++;
j++;
}
else if (j == 0)
i++;
else
j = failure[j - 1] + 1;
}
return ((j == needleLen) ? (i - needleLen) : -1);
}
}
import java.util.Random;
class KMPSearchTest {
private static Random random = new Random();
private static String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
@Test
public void testEmpty() {
test("", "");
test("", "ab");
}
@Test
public void testOneChar() {
test("a", "a");
test("a", "b");
}
@Test
public void testRepeat() {
test("aaa", "aaaaa");
test("aaa", "abaaba");
test("abab", "abacababc");
test("abab", "babacaba");
}
@Test
public void testPartialRepeat() {
test("aaacaaaaac", "aaacacaacaaacaaaacaaaaac");
test("ababcababdabababcababdaba", "ababcababdabababcababdaba");
}
@Test
public void testRandomly() {
for (int i = 0; i < 1000; i++) {
String pattern = randomPattern();
for (int j = 0; j < 100; j++)
test(pattern, randomText(pattern));
}
}
/* Helper functions */
private static String randomPattern() {
StringBuilder sb = new StringBuilder();
int steps = random.nextInt(10) + 1;
for (int i = 0; i < steps; i++) {
if (sb.length() == 0 || random.nextBoolean()) { // Add literal
int len = random.nextInt(5) + 1;
for (int j = 0; j < len; j++)
sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
} else { // Repeat prefix
int len = random.nextInt(sb.length()) + 1;
int reps = random.nextInt(3) + 1;
if (sb.length() + len * reps > 1000)
break;
for (int j = 0; j < reps; j++)
sb.append(sb.substring(0, len));
}
}
return sb.toString();
}
private static String randomText(String pattern) {
StringBuilder sb = new StringBuilder();
int steps = random.nextInt(100);
for (int i = 0; i < steps && sb.length() < 10000; i++) {
if (random.nextDouble() < 0.7) { // Add prefix of pattern
int len = random.nextInt(pattern.length()) + 1;
sb.append(pattern.substring(0, len));
} else { // Add literal
int len = random.nextInt(30) + 1;
for (int j = 0; j < len; j++)
sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
}
}
return sb.toString();
}
private static void test(String pattern, String text) {
try {
assertEquals(text.indexOf(pattern), KMPSearch.indexOf(text.getBytes(), pattern.getBytes()));
} catch (AssertionError e) {
System.out.println("FAILED -> Unable to find '" + pattern + "' in '" + text + "'");
}
}
}
答案 9 :(得分:0)
对于我当前正在使用的一个小型HTTP服务器,我想出了以下代码来查找multipart / form-data请求中的边界。希望在这里找到更好的解决方案,但我想我会坚持下去。我认为这是可以达到的效果(相当快并且不使用太多内存)。它使用输入字节作为环形缓冲区,在边界不匹配时立即读取下一个字节,并将第一个完整周期后的数据写入输出流。当然,可以按照问题中的说明,将其更改为字节数组而不是流。
private boolean multipartUploadParseOutput(InputStream is, OutputStream os, String boundary)
{
try
{
String n = "--"+boundary;
byte[] bc = n.getBytes("UTF-8");
int s = bc.length;
byte[] b = new byte[s];
int p = 0;
long l = 0;
int c;
boolean r;
while ((c = is.read()) != -1)
{
b[p] = (byte) c;
l += 1;
p = (int) (l % s);
if (l>p)
{
r = true;
for (int i = 0; i < s; i++)
{
if (b[(p + i) % s] != bc[i])
{
r = false;
break;
}
}
if (r)
break;
os.write(b[p]);
}
}
os.flush();
return true;
} catch(IOException e) {e.printStackTrace();}
return false;
}