我想知道是否有人知道从Java应用程序获取当前网页上的所有文本的好方法。
我尝试了两种方法:
OCR:这对我来说不够准确,因为文本大致只有60%正确。此外,它只获得了屏幕截图可以看到的文字,我需要页面上的所有文字
机器人类:我现在使用的方法是使用机器人类来控制Control-A,Control-C方法,然后从剪贴板中获取文本。在获取文本方面,这种方法已被证明是有用的。我遇到的唯一问题是用户会看到突出显示的文本,这是我不希望他们看到的。
这可能听起来像某种形式的间谍软件,虽然这是大学最后一年的项目及其反网络欺凌/儿童美容计划,并且只会在检测到犯规时存储信息。
有人能想出一种更好的方法让文本脱离浏览器吗?
非常感谢
答案 0 :(得分:2)
获取URL并使用HTTP客户端类读取页面。即Apache Commons HTTPGet。
有关详情,请参阅此处:http://hc.apache.org/httpclient-3.x/tutorial.html
答案 1 :(得分:1)
您可以使用URLConnection或Apache的HTTPClient从网站上获取所有HTML 这是解释如何做到这一点的问题: Get html file Java
当然它不会在二进制文件(即flash文件)图像等中给你文本。对于那些只有OCR才能工作。
答案 2 :(得分:1)
您可以尝试这样的事情
GetMethod get = new GetMethod("http://ThePage.com");
InputStream in = get.getResponseBodyAsStream();
String htmlText = readString(in);
static String readString(InputStream is) throws IOException {
char[] buf = new char[2048];
Reader r = new InputStreamReader(is, "UTF-8");
StringBuilder s = new StringBuilder();
while (true) {
int n = r.read(buf);
if (n < 0)
break;
s.append(buf, 0, n);
}
return s.toString();
}
答案 3 :(得分:0)
最通用的解决方案是流量嗅探器。
答案 4 :(得分:-1)
这是我为此目的创建的实用程序类。它具有运行时和非运行时版本,还提供验证检索到的源的尾端。
import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.MalformedURLException;
import java.io.EOFException;
import java.net.URL;
/**
<P>Append the source-code from a web-page into a <CODE>java.lang.Appendable</CODE>.</P>
<P>Demo: {@code java AppendWebPageSource}</P>
**/
public class AppendWebPageSource {
public static final void main(String[] igno_red) {
String sHtml = AppendWebPageSource.get("http://usatoday.com", null);
System.out.println(sHtml);
//Alternative:
AppendWebPageSource.append(System.out, "http://usatoday.com", null);
}
/**
<P>Get the source-code from a web page, with runtime-errors only.</P>
@return {@link #append(Appendable, String, String) append}{@code ((new StringBuilder()), s_httpUrl, s_endingString)}
**/
public static final String get(String s_httpUrl, String s_endingString) {
return append((new StringBuilder()), s_httpUrl, s_endingString).toString();
}
/**
<P>Append the source-code from a web page, with runtime-errors only.</P>
@return {@link #appendX(Appendable, String, String) appendX}{@code (ap_bl, s_httpUrl, s_endingString)}
@exception RuntimeException Whose {@link getCause()} contains the original {@link java.io.IOException} or {@code java.net.MalformedURLException}.
**/
public static final Appendable append(Appendable ap_bl, String s_httpUrl, String s_endingString) {
try {
return appendX(ap_bl, s_httpUrl, s_endingString);
} catch(IOException iox) {
throw new RuntimeException(iox);
}
}
/**
<P>Append the source-code from a web-page into a <CODE>java.lang.Appendable</CODE>.</P>
<P><I>I got this from {@code <A HREF="http://www.davidreilly.com/java/java_network_programming/">http://www.davidreilly.com/java/java_network_programming/</A>}, item 2.3.</I></P>
@param ap_bl May not be {@code null}.
@param s_httpUrl May not be {@code null}, and must be a valid url.
@param s_endingString If non-{@code null}, the web-page's source-code must end with this. May not be empty.
@see #get(Appendable, String, String)
@see #append(Appendable, String, String)
**/
public static final Appendable appendX(Appendable ap_bl, String s_httpUrl, String s_endingString) throws MalformedURLException, IOException {
if(s_httpUrl == null || s_httpUrl.length() == 0) {
throw new IllegalArgumentException("s_httpUrl (\"" + s_httpUrl + "\") is null or empty.");
}
if(s_endingString != null && s_endingString.length() == 0) {
throw new IllegalArgumentException("s_endingString is non-null and empty.");
}
// Create an URL instance
URL url = new URL(s_httpUrl);
// Get an input stream for reading
InputStream is = url.openStream();
// Create a buffered input stream for efficency
BufferedInputStream bis = new BufferedInputStream(is);
int ixEndStr = 0;
// Repeat until end of file
while(true) {
int iChar = bis.read();
// Check for EOF
if (iChar == -1) {
break;
}
char c = (char)iChar;
try {
ap_bl.append(c);
} catch(NullPointerException npx) {
throw new NullPointerException("ap_bl");
}
if(s_endingString != null) {
//There is an ending string;
char[] ac = s_endingString.toCharArray();
if(c == ac[ixEndStr]) {
//The character just retrieved is equal to the
//next character in the ending string.
if(ixEndStr == (ac.length - 1)) {
//The entire string has been found. Done.
return ap_bl;
}
ixEndStr++;
} else {
ixEndStr = 0;
}
}
}
if(s_endingString != null) {
//Should have exited at the "return" above.
throw new EOFException("s_endingString " + (new String(s_endingString)) + " (is non-null, and was not found at the end of the web-page's source-code.");
}
return ap_bl;
}
}