过滤(搜索和替换)InputStream中的字节数组

时间:2011-10-12 16:44:32

标签: java input bytearray

我有一个InputStream,它将html文件作为输入参数。我必须从输入流中获取字节。

我有一个字符串:"XYZ"。我想将此字符串转换为字节格式,并检查我从InputStream获取的字节序列中是否存在匹配项。如果有,那么我必须用匹配序列替换匹配的其他字符串。

有没有人可以帮我这个?我使用正则表达式来查找和替换。但是发现并替换字节流,我不知道。

以前,我使用jsoup来解析html并替换字符串,但由于某些utf编码问题,当我这样做时,该文件似乎已损坏。

TL; DR:我的问题是:

是一种在Java中的原始InputStream中以字节格式查找和替换字符串的方法吗?

6 个答案:

答案 0 :(得分:28)

不确定您是否选择了解决问题的最佳方法。

那就是说,我不喜欢(并且有政策不要)用“不要”来回答问题,所以这里就是......

查看FilterInputStream

来自文档:

  

FilterInputStream包含一些其他输入流,它用作其基本数据源,可能沿途转换数据或提供其他功能。


编写它是一项有趣的练习。以下是您的完整示例:

import java.io.*;
import java.util.*;

class ReplacingInputStream extends FilterInputStream {

    LinkedList<Integer> inQueue = new LinkedList<Integer>();
    LinkedList<Integer> outQueue = new LinkedList<Integer>();
    final byte[] search, replacement;

    protected ReplacingInputStream(InputStream in,
                                   byte[] search,
                                   byte[] replacement) {
        super(in);
        this.search = search;
        this.replacement = replacement;
    }

    private boolean isMatchFound() {
        Iterator<Integer> inIter = inQueue.iterator();
        for (int i = 0; i < search.length; i++)
            if (!inIter.hasNext() || search[i] != inIter.next())
                return false;
        return true;
    }

    private void readAhead() throws IOException {
        // Work up some look-ahead.
        while (inQueue.size() < search.length) {
            int next = super.read();
            inQueue.offer(next);
            if (next == -1)
                break;
        }
    }

    @Override
    public int read() throws IOException {    
        // Next byte already determined.
        if (outQueue.isEmpty()) {
            readAhead();

            if (isMatchFound()) {
                for (int i = 0; i < search.length; i++)
                    inQueue.remove();

                for (byte b : replacement)
                    outQueue.offer((int) b);
            } else
                outQueue.add(inQueue.remove());
        }

        return outQueue.remove();
    }

    // TODO: Override the other read methods.
}

使用示例

class Test {
    public static void main(String[] args) throws Exception {

        byte[] bytes = "hello xyz world.".getBytes("UTF-8");

        ByteArrayInputStream bis = new ByteArrayInputStream(bytes);

        byte[] search = "xyz".getBytes("UTF-8");
        byte[] replacement = "abc".getBytes("UTF-8");

        InputStream ris = new ReplacingInputStream(bis, search, replacement);

        ByteArrayOutputStream bos = new ByteArrayOutputStream();

        int b;
        while (-1 != (b = ris.read()))
            bos.write(b);

        System.out.println(new String(bos.toByteArray()));

    }
}

给定字符串"Hello xyz world"的字节,它打印:

Hello abc world

答案 1 :(得分:4)

以下方法可行,但我对性能的影响不大。

  1. 使用InputStream
  2. 包裹InputStreamReader
  3. InputStreamReader换成FilterReader替换字符串,然后
  4. 使用FilterReader
  5. 包裹ReaderInputStream

    选择适当的编码至关重要,否则流的内容将被破坏。

    如果您想使用正则表达式替换字符串,那么您可以使用我的工具Streamflyer,这是FilterReader的一种方便的替代方法。您将在Streamflyer的网页上找到字节流的示例。希望这会有所帮助。

答案 2 :(得分:4)

我也需要这样的东西,并决定推出自己的解决方案,而不是使用@aioobe上面的例子。看看code。您可以从maven中心提取库,或者只复制源代码。

这就是你如何使用它。在这种情况下,我使用嵌套实例替换两个模式,两个修复dos和mac行结尾。

new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");

这里是完整的源代码:

/**
 * Simple FilterInputStream that can replace occurrances of bytes with something else.
 */
public class ReplacingInputStream extends FilterInputStream {

    // while matching, this is where the bytes go.
    int[] buf=null;
    int matchedIndex=0;
    int unbufferIndex=0;
    int replacedIndex=0;

    private final byte[] pattern;
    private final byte[] replacement;
    private State state=State.NOT_MATCHED;

    // simple state machine for keeping track of what we are doing
    private enum State {
        NOT_MATCHED,
        MATCHING,
        REPLACING,
        UNBUFFER
    }

    /**
     * @param is input
     * @return nested replacing stream that replaces \n\r (DOS) and \r (MAC) line endings with UNIX ones "\n".
     */
    public static InputStream newLineNormalizingInputStream(InputStream is) {
        return new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
    }

    /**
     * Replace occurances of pattern in the input. Note: input is assumed to be UTF-8 encoded. If not the case use byte[] based pattern and replacement.
     * @param in input
     * @param pattern pattern to replace.
     * @param replacement the replacement or null
     */
    public ReplacingInputStream(InputStream in, String pattern, String replacement) {
        this(in,pattern.getBytes(StandardCharsets.UTF_8), replacement==null ? null : replacement.getBytes(StandardCharsets.UTF_8));
    }

    /**
     * Replace occurances of pattern in the input.
     * @param in input
     * @param pattern pattern to replace
     * @param replacement the replacement or null
     */
    public ReplacingInputStream(InputStream in, byte[] pattern, byte[] replacement) {
        super(in);
        Validate.notNull(pattern);
        Validate.isTrue(pattern.length>0, "pattern length should be > 0", pattern.length);
        this.pattern = pattern;
        this.replacement = replacement;
        // we will never match more than the pattern length
        buf = new int[pattern.length];
    }

    @Override
    public int read(byte[] b, int off, int len) throws IOException {
        // copy of parent logic; we need to call our own read() instead of super.read(), which delegates instead of calling our read
        if (b == null) {
            throw new NullPointerException();
        } else if (off < 0 || len < 0 || len > b.length - off) {
            throw new IndexOutOfBoundsException();
        } else if (len == 0) {
            return 0;
        }

        int c = read();
        if (c == -1) {
            return -1;
        }
        b[off] = (byte)c;

        int i = 1;
        try {
            for (; i < len ; i++) {
                c = read();
                if (c == -1) {
                    break;
                }
                b[off + i] = (byte)c;
            }
        } catch (IOException ee) {
        }
        return i;

    }

    @Override
    public int read(byte[] b) throws IOException {
        // call our own read
        return read(b, 0, b.length);
    }

    @Override
    public int read() throws IOException {
        // use a simple state machine to figure out what we are doing
        int next;
        switch (state) {
        case NOT_MATCHED:
            // we are not currently matching, replacing, or unbuffering
            next=super.read();
            if(pattern[0] == next) {
                // clear whatever was there
                buf=new int[pattern.length]; // clear whatever was there
                // make sure we start at 0
                matchedIndex=0;

                buf[matchedIndex++]=next;
                if(pattern.length == 1) {
                    // edgecase when the pattern length is 1 we go straight to replacing
                    state=State.REPLACING;
                    // reset replace counter
                    replacedIndex=0;
                } else {
                    // pattern of length 1
                    state=State.MATCHING;
                }
                // recurse to continue matching
                return read();
            } else {
                return next;
            }
        case MATCHING:
            // the previous bytes matched part of the pattern
            next=super.read();
            if(pattern[matchedIndex]==next) {
                buf[matchedIndex++]=next;
                if(matchedIndex==pattern.length) {
                    // we've found a full match!
                    if(replacement==null || replacement.length==0) {
                        // the replacement is empty, go straight to NOT_MATCHED
                        state=State.NOT_MATCHED;
                        matchedIndex=0;
                    } else {
                        // start replacing
                        state=State.REPLACING;
                        replacedIndex=0;
                    }
                }
            } else {
                // mismatch -> unbuffer
                buf[matchedIndex++]=next;
                state=State.UNBUFFER;
                unbufferIndex=0;
            }
            return read();
        case REPLACING:
            // we've fully matched the pattern and are returning bytes from the replacement
            next=replacement[replacedIndex++];
            if(replacedIndex==replacement.length) {
                state=State.NOT_MATCHED;
                replacedIndex=0;
            }
            return next;
        case UNBUFFER:
            // we partially matched the pattern before encountering a non matching byte
            // we need to serve up the buffered bytes before we go back to NOT_MATCHED
            next=buf[unbufferIndex++];
            if(unbufferIndex==matchedIndex) {
                state=State.NOT_MATCHED;
                matchedIndex=0;
            }
            return next;

        default:
            throw new IllegalStateException("no such state " + state);
        }
    }

    @Override
    public String toString() {
        return state.name() + " " + matchedIndex + " " + replacedIndex + " " + unbufferIndex;
    }

}

答案 3 :(得分:2)

字节流(InputStream)上没有任何内置的搜索和替换功能。

并且,有效且正确地完成此任务的方法并不是立即显而易见的。我已经为流实现了Boyer-Moore算法,它运行良好,但需要一些时间。如果没有这样的算法,你必须采用蛮力方法,look for the pattern starting at every position in the stream,可能会很慢。

即使您将HTML解码为文本using a regular expression to match patterns might be a bad idea,,因为HTML不是“常规”语言。

所以,即使你遇到了一些困难,我建议你采用原始方法将HTML解析为文档。虽然您在使用字符编码时遇到了问题,但从长远来看,修复正确的解决方案可能会比判断错误的解决方案更容易。

答案 4 :(得分:1)

我需要一个解决方案,但发现这里的答案导致过多的内存和/或CPU开销。根据简单的基准测试,以下解决方案在这些方面明显优于其他解决方案。

此解决方案特别节省内存,即使使用> GB流,也不会产生可衡量的成本。

也就是说,这不是零CPU成本的解决方案。除了最苛刻/对资源敏感的方案外,CPU /处理时间开销对于所有其他情况可能都是合理的,但是开销是真实的,在评估在给定上下文中采用此解决方案的价值时应考虑这些开销。

就我而言,我们正在处理的最大实际文件大小约为6MB,其中替换了44个URL,从而增加了约170ms的延迟。这适用于在具有单个CPU共享(1024)的AWS ECS上运行的基于Zuul的反向代理。对于大多数文件(小于100KB),增加的延迟不到1毫秒。在高并发性(并因此导致CPU争用)下,增加的延迟可能会增加,但是,我们目前能够在单个节点上同时处理数百个文件,而不会引起明显的延迟影响。

我们正在使用的解决方案:

import java.io.IOException;
import java.io.InputStream;

public class TokenReplacingStream extends InputStream {

    private final InputStream source;
    private final byte[] oldBytes;
    private final byte[] newBytes;
    private int tokenMatchIndex = 0;
    private int bytesIndex = 0;
    private boolean unwinding;
    private int mismatch;
    private int numberOfTokensReplaced = 0;

    public TokenReplacingStream(InputStream source, byte[] oldBytes, byte[] newBytes) {
        assert oldBytes.length > 0;
        this.source = source;
        this.oldBytes = oldBytes;
        this.newBytes = newBytes;
    }

    @Override
    public int read() throws IOException {

        if (unwinding) {
            if (bytesIndex < tokenMatchIndex) {
                return oldBytes[bytesIndex++];
            } else {
                bytesIndex = 0;
                tokenMatchIndex = 0;
                unwinding = false;
                return mismatch;
            }
        } else if (tokenMatchIndex == oldBytes.length) {
            if (bytesIndex == newBytes.length) {
                bytesIndex = 0;
                tokenMatchIndex = 0;
                numberOfTokensReplaced++;
            } else {
                return newBytes[bytesIndex++];
            }
        }

        int b = source.read();
        if (b == oldBytes[tokenMatchIndex]) {
            tokenMatchIndex++;
        } else if (tokenMatchIndex > 0) {
            mismatch = b;
            unwinding = true;
        } else {
            return b;
        }

        return read();

    }

    @Override
    public void close() throws IOException {
        source.close();
    }

    public int getNumberOfTokensReplaced() {
        return numberOfTokensReplaced;
    }

}

答案 5 :(得分:1)

当我需要在Servlet中为模板文件提供服务时,我想到了这段简单的代码,用值替换了某个关键字。它应该非常快并且内存不足。然后,我猜想使用管道流可以将其用于各种各样的事情。

/ JC

public static void replaceStream(InputStream in, OutputStream out, String search, String replace) throws IOException
{
    replaceStream(new InputStreamReader(in), new OutputStreamWriter(out), search, replace);
}

public static void replaceStream(Reader in, Writer out, String search, String replace) throws IOException
{
    char[] searchChars = search.toCharArray();
    int[] buffer = new int[searchChars.length];

    int x, r, si = 0, sm = searchChars.length;
    while ((r = in.read()) > 0) {

        if (searchChars[si] == r) {
            // The char matches our pattern
            buffer[si++] = r;

            if (si == sm) {
                // We have reached a matching string
                out.write(replace);
                si = 0;
            }
        } else if (si > 0) {
            // No match and buffered char(s), empty buffer and pass the char forward
            for (x = 0; x < si; x++) {
                out.write(buffer[x]);
            }
            si = 0;
            out.write(r);
        } else {
            // No match and nothing buffered, just pass the char forward
            out.write(r);
        }
    }

    // Empty buffer
    for (x = 0; x < si; x++) {
        out.write(buffer[x]);
    }
}