我有一个InputStream,它将html文件作为输入参数。我必须从输入流中获取字节。
我有一个字符串:"XYZ"
。我想将此字符串转换为字节格式,并检查我从InputStream获取的字节序列中是否存在匹配项。如果有,那么我必须用匹配序列替换匹配的其他字符串。
有没有人可以帮我这个?我使用正则表达式来查找和替换。但是发现并替换字节流,我不知道。
以前,我使用jsoup来解析html并替换字符串,但由于某些utf编码问题,当我这样做时,该文件似乎已损坏。
TL; DR:我的问题是:
是一种在Java中的原始InputStream中以字节格式查找和替换字符串的方法吗?
答案 0 :(得分:28)
不确定您是否选择了解决问题的最佳方法。
那就是说,我不喜欢(并且有政策不要)用“不要”来回答问题,所以这里就是......
来自文档:
FilterInputStream包含一些其他输入流,它用作其基本数据源,可能沿途转换数据或提供其他功能。
编写它是一项有趣的练习。以下是您的完整示例:
import java.io.*;
import java.util.*;
class ReplacingInputStream extends FilterInputStream {
LinkedList<Integer> inQueue = new LinkedList<Integer>();
LinkedList<Integer> outQueue = new LinkedList<Integer>();
final byte[] search, replacement;
protected ReplacingInputStream(InputStream in,
byte[] search,
byte[] replacement) {
super(in);
this.search = search;
this.replacement = replacement;
}
private boolean isMatchFound() {
Iterator<Integer> inIter = inQueue.iterator();
for (int i = 0; i < search.length; i++)
if (!inIter.hasNext() || search[i] != inIter.next())
return false;
return true;
}
private void readAhead() throws IOException {
// Work up some look-ahead.
while (inQueue.size() < search.length) {
int next = super.read();
inQueue.offer(next);
if (next == -1)
break;
}
}
@Override
public int read() throws IOException {
// Next byte already determined.
if (outQueue.isEmpty()) {
readAhead();
if (isMatchFound()) {
for (int i = 0; i < search.length; i++)
inQueue.remove();
for (byte b : replacement)
outQueue.offer((int) b);
} else
outQueue.add(inQueue.remove());
}
return outQueue.remove();
}
// TODO: Override the other read methods.
}
class Test {
public static void main(String[] args) throws Exception {
byte[] bytes = "hello xyz world.".getBytes("UTF-8");
ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
byte[] search = "xyz".getBytes("UTF-8");
byte[] replacement = "abc".getBytes("UTF-8");
InputStream ris = new ReplacingInputStream(bis, search, replacement);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
int b;
while (-1 != (b = ris.read()))
bos.write(b);
System.out.println(new String(bos.toByteArray()));
}
}
给定字符串"Hello xyz world"
的字节,它打印:
Hello abc world
答案 1 :(得分:4)
以下方法可行,但我对性能的影响不大。
InputStream
,InputStreamReader
InputStreamReader
换成FilterReader
替换字符串,然后FilterReader
ReaderInputStream
醇>
选择适当的编码至关重要,否则流的内容将被破坏。
如果您想使用正则表达式替换字符串,那么您可以使用我的工具Streamflyer,这是FilterReader
的一种方便的替代方法。您将在Streamflyer的网页上找到字节流的示例。希望这会有所帮助。
答案 2 :(得分:4)
我也需要这样的东西,并决定推出自己的解决方案,而不是使用@aioobe上面的例子。看看code。您可以从maven中心提取库,或者只复制源代码。
这就是你如何使用它。在这种情况下,我使用嵌套实例替换两个模式,两个修复dos和mac行结尾。
new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
这里是完整的源代码:
/**
* Simple FilterInputStream that can replace occurrances of bytes with something else.
*/
public class ReplacingInputStream extends FilterInputStream {
// while matching, this is where the bytes go.
int[] buf=null;
int matchedIndex=0;
int unbufferIndex=0;
int replacedIndex=0;
private final byte[] pattern;
private final byte[] replacement;
private State state=State.NOT_MATCHED;
// simple state machine for keeping track of what we are doing
private enum State {
NOT_MATCHED,
MATCHING,
REPLACING,
UNBUFFER
}
/**
* @param is input
* @return nested replacing stream that replaces \n\r (DOS) and \r (MAC) line endings with UNIX ones "\n".
*/
public static InputStream newLineNormalizingInputStream(InputStream is) {
return new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
}
/**
* Replace occurances of pattern in the input. Note: input is assumed to be UTF-8 encoded. If not the case use byte[] based pattern and replacement.
* @param in input
* @param pattern pattern to replace.
* @param replacement the replacement or null
*/
public ReplacingInputStream(InputStream in, String pattern, String replacement) {
this(in,pattern.getBytes(StandardCharsets.UTF_8), replacement==null ? null : replacement.getBytes(StandardCharsets.UTF_8));
}
/**
* Replace occurances of pattern in the input.
* @param in input
* @param pattern pattern to replace
* @param replacement the replacement or null
*/
public ReplacingInputStream(InputStream in, byte[] pattern, byte[] replacement) {
super(in);
Validate.notNull(pattern);
Validate.isTrue(pattern.length>0, "pattern length should be > 0", pattern.length);
this.pattern = pattern;
this.replacement = replacement;
// we will never match more than the pattern length
buf = new int[pattern.length];
}
@Override
public int read(byte[] b, int off, int len) throws IOException {
// copy of parent logic; we need to call our own read() instead of super.read(), which delegates instead of calling our read
if (b == null) {
throw new NullPointerException();
} else if (off < 0 || len < 0 || len > b.length - off) {
throw new IndexOutOfBoundsException();
} else if (len == 0) {
return 0;
}
int c = read();
if (c == -1) {
return -1;
}
b[off] = (byte)c;
int i = 1;
try {
for (; i < len ; i++) {
c = read();
if (c == -1) {
break;
}
b[off + i] = (byte)c;
}
} catch (IOException ee) {
}
return i;
}
@Override
public int read(byte[] b) throws IOException {
// call our own read
return read(b, 0, b.length);
}
@Override
public int read() throws IOException {
// use a simple state machine to figure out what we are doing
int next;
switch (state) {
case NOT_MATCHED:
// we are not currently matching, replacing, or unbuffering
next=super.read();
if(pattern[0] == next) {
// clear whatever was there
buf=new int[pattern.length]; // clear whatever was there
// make sure we start at 0
matchedIndex=0;
buf[matchedIndex++]=next;
if(pattern.length == 1) {
// edgecase when the pattern length is 1 we go straight to replacing
state=State.REPLACING;
// reset replace counter
replacedIndex=0;
} else {
// pattern of length 1
state=State.MATCHING;
}
// recurse to continue matching
return read();
} else {
return next;
}
case MATCHING:
// the previous bytes matched part of the pattern
next=super.read();
if(pattern[matchedIndex]==next) {
buf[matchedIndex++]=next;
if(matchedIndex==pattern.length) {
// we've found a full match!
if(replacement==null || replacement.length==0) {
// the replacement is empty, go straight to NOT_MATCHED
state=State.NOT_MATCHED;
matchedIndex=0;
} else {
// start replacing
state=State.REPLACING;
replacedIndex=0;
}
}
} else {
// mismatch -> unbuffer
buf[matchedIndex++]=next;
state=State.UNBUFFER;
unbufferIndex=0;
}
return read();
case REPLACING:
// we've fully matched the pattern and are returning bytes from the replacement
next=replacement[replacedIndex++];
if(replacedIndex==replacement.length) {
state=State.NOT_MATCHED;
replacedIndex=0;
}
return next;
case UNBUFFER:
// we partially matched the pattern before encountering a non matching byte
// we need to serve up the buffered bytes before we go back to NOT_MATCHED
next=buf[unbufferIndex++];
if(unbufferIndex==matchedIndex) {
state=State.NOT_MATCHED;
matchedIndex=0;
}
return next;
default:
throw new IllegalStateException("no such state " + state);
}
}
@Override
public String toString() {
return state.name() + " " + matchedIndex + " " + replacedIndex + " " + unbufferIndex;
}
}
答案 3 :(得分:2)
字节流(InputStream
)上没有任何内置的搜索和替换功能。
并且,有效且正确地完成此任务的方法并不是立即显而易见的。我已经为流实现了Boyer-Moore算法,它运行良好,但需要一些时间。如果没有这样的算法,你必须采用蛮力方法,look for the pattern starting at every position in the stream,可能会很慢。
即使您将HTML解码为文本using a regular expression to match patterns might be a bad idea,,因为HTML不是“常规”语言。
所以,即使你遇到了一些困难,我建议你采用原始方法将HTML解析为文档。虽然您在使用字符编码时遇到了问题,但从长远来看,修复正确的解决方案可能会比判断错误的解决方案更容易。
答案 4 :(得分:1)
我需要一个解决方案,但发现这里的答案导致过多的内存和/或CPU开销。根据简单的基准测试,以下解决方案在这些方面明显优于其他解决方案。
此解决方案特别节省内存,即使使用> GB流,也不会产生可衡量的成本。
也就是说,这不是零CPU成本的解决方案。除了最苛刻/对资源敏感的方案外,CPU /处理时间开销对于所有其他情况可能都是合理的,但是开销是真实的,在评估在给定上下文中采用此解决方案的价值时应考虑这些开销。
就我而言,我们正在处理的最大实际文件大小约为6MB,其中替换了44个URL,从而增加了约170ms的延迟。这适用于在具有单个CPU共享(1024)的AWS ECS上运行的基于Zuul的反向代理。对于大多数文件(小于100KB),增加的延迟不到1毫秒。在高并发性(并因此导致CPU争用)下,增加的延迟可能会增加,但是,我们目前能够在单个节点上同时处理数百个文件,而不会引起明显的延迟影响。
我们正在使用的解决方案:
import java.io.IOException;
import java.io.InputStream;
public class TokenReplacingStream extends InputStream {
private final InputStream source;
private final byte[] oldBytes;
private final byte[] newBytes;
private int tokenMatchIndex = 0;
private int bytesIndex = 0;
private boolean unwinding;
private int mismatch;
private int numberOfTokensReplaced = 0;
public TokenReplacingStream(InputStream source, byte[] oldBytes, byte[] newBytes) {
assert oldBytes.length > 0;
this.source = source;
this.oldBytes = oldBytes;
this.newBytes = newBytes;
}
@Override
public int read() throws IOException {
if (unwinding) {
if (bytesIndex < tokenMatchIndex) {
return oldBytes[bytesIndex++];
} else {
bytesIndex = 0;
tokenMatchIndex = 0;
unwinding = false;
return mismatch;
}
} else if (tokenMatchIndex == oldBytes.length) {
if (bytesIndex == newBytes.length) {
bytesIndex = 0;
tokenMatchIndex = 0;
numberOfTokensReplaced++;
} else {
return newBytes[bytesIndex++];
}
}
int b = source.read();
if (b == oldBytes[tokenMatchIndex]) {
tokenMatchIndex++;
} else if (tokenMatchIndex > 0) {
mismatch = b;
unwinding = true;
} else {
return b;
}
return read();
}
@Override
public void close() throws IOException {
source.close();
}
public int getNumberOfTokensReplaced() {
return numberOfTokensReplaced;
}
}
答案 5 :(得分:1)
当我需要在Servlet中为模板文件提供服务时,我想到了这段简单的代码,用值替换了某个关键字。它应该非常快并且内存不足。然后,我猜想使用管道流可以将其用于各种各样的事情。
/ JC
public static void replaceStream(InputStream in, OutputStream out, String search, String replace) throws IOException
{
replaceStream(new InputStreamReader(in), new OutputStreamWriter(out), search, replace);
}
public static void replaceStream(Reader in, Writer out, String search, String replace) throws IOException
{
char[] searchChars = search.toCharArray();
int[] buffer = new int[searchChars.length];
int x, r, si = 0, sm = searchChars.length;
while ((r = in.read()) > 0) {
if (searchChars[si] == r) {
// The char matches our pattern
buffer[si++] = r;
if (si == sm) {
// We have reached a matching string
out.write(replace);
si = 0;
}
} else if (si > 0) {
// No match and buffered char(s), empty buffer and pass the char forward
for (x = 0; x < si; x++) {
out.write(buffer[x]);
}
si = 0;
out.write(r);
} else {
// No match and nothing buffered, just pass the char forward
out.write(r);
}
}
// Empty buffer
for (x = 0; x < si; x++) {
out.write(buffer[x]);
}
}