Question

我想知道是否有人知道从Java应用程序获取当前网页上的所有文本的好方法。

我尝试了两种方法：

OCR：这对我来说不够准确，因为文本大致只有60％正确。此外，它只获得了屏幕截图可以看到的文字，我需要页面上的所有文字
机器人类：我现在使用的方法是使用机器人类来控制Control-A，Control-C方法，然后从剪贴板中获取文本。在获取文本方面，这种方法已被证明是有用的。我遇到的唯一问题是用户会看到突出显示的文本，这是我不希望他们看到的。

这可能听起来像某种形式的间谍软件，虽然这是大学最后一年的项目及其反网络欺凌/儿童美容计划，并且只会在检测到犯规时存储信息。

有人能想出一种更好的方法让文本脱离浏览器吗？

非常感谢

Answer 1

获取URL并使用HTTP客户端类读取页面。即Apache Commons HTTPGet。

有关详情，请参阅此处：http://hc.apache.org/httpclient-3.x/tutorial.html

Answer 2

您可以使用URLConnection或Apache的HTTPClient从网站上获取所有HTML 这是解释如何做到这一点的问题： Get html file Java

当然它不会在二进制文件（即flash文件）图像等中给你文本。对于那些只有OCR才能工作。

Answer 3

您可以尝试这样的事情

GetMethod get = new GetMethod("http://ThePage.com");
InputStream in = get.getResponseBodyAsStream();
String htmlText = readString(in);

static String readString(InputStream is) throws IOException {
char[] buf = new char[2048];
Reader r = new InputStreamReader(is, "UTF-8");
StringBuilder s = new StringBuilder();
while (true) {
   int n = r.read(buf);
    if (n < 0)
      break;
    s.append(buf, 0, n);
  }
  return s.toString();
}

Answer 4

最通用的解决方案是流量嗅探器。

Answer 5

这是我为此目的创建的实用程序类。它具有运行时和非运行时版本，还提供验证检索到的源的尾端。

   import  java.io.BufferedInputStream;
   import  java.io.IOException;
   import  java.io.InputStream;
   import  java.net.MalformedURLException;
   import  java.io.EOFException;
   import  java.net.URL;

/**
   <P>Append the source-code from a web-page into a <CODE>java.lang.Appendable</CODE>.</P>

   <P>Demo: {@code java AppendWebPageSource}</P>
 **/
public class AppendWebPageSource  {
   public static final void main(String[] igno_red)  {
      String sHtml = AppendWebPageSource.get("http://usatoday.com", null);
      System.out.println(sHtml);   

      //Alternative:
      AppendWebPageSource.append(System.out, "http://usatoday.com", null);
   }
   /**
      <P>Get the source-code from a web page, with runtime-errors only.</P>

      @return  {@link #append(Appendable, String, String) append}{@code ((new StringBuilder()), s_httpUrl, s_endingString)}
    **/
   public static final String get(String s_httpUrl, String s_endingString)  {
      return  append((new StringBuilder()), s_httpUrl, s_endingString).toString();
   }
   /**
      <P>Append the source-code from a web page, with runtime-errors only.</P>

      @return  {@link #appendX(Appendable, String, String) appendX}{@code (ap_bl, s_httpUrl, s_endingString)}
      @exception  RuntimeException  Whose {@link getCause()} contains the original {@link java.io.IOException} or {@code java.net.MalformedURLException}.
    **/
   public static final Appendable append(Appendable ap_bl, String s_httpUrl, String s_endingString)  {
      try  {
         return  appendX(ap_bl, s_httpUrl, s_endingString);
      }  catch(IOException iox)  {
         throw  new RuntimeException(iox);
      }
   }
   /**
      <P>Append the source-code from a web-page into a <CODE>java.lang.Appendable</CODE>.</P>

      <P><I>I got this from {@code <A HREF="http://www.davidreilly.com/java/java_network_programming/">http://www.davidreilly.com/java/java_network_programming/</A>}, item 2.3.</I></P>

      @param  ap_bl  May not be {@code null}.
      @param  s_httpUrl  May not be {@code null}, and must be a valid url.
      @param  s_endingString  If non-{@code null}, the web-page's source-code must end with this. May not be empty.
      @see  #get(Appendable, String, String)
      @see  #append(Appendable, String, String)
    **/
   public static final Appendable appendX(Appendable ap_bl, String s_httpUrl, String s_endingString)  throws MalformedURLException, IOException  {
      if(s_httpUrl == null  ||  s_httpUrl.length() == 0)  {
         throw  new IllegalArgumentException("s_httpUrl (\"" + s_httpUrl + "\") is null or empty.");
      }
      if(s_endingString != null  &&  s_endingString.length() == 0)  {
         throw  new IllegalArgumentException("s_endingString is non-null and empty.");
      }

      // Create an URL instance
      URL url = new URL(s_httpUrl);

      // Get an input stream for reading
      InputStream is = url.openStream();

      // Create a buffered input stream for efficency
      BufferedInputStream bis = new BufferedInputStream(is);

      int ixEndStr = 0;

      // Repeat until end of file
      while(true)  {
         int iChar = bis.read();

         // Check for EOF
         if (iChar == -1)  {
            break;
         }

         char c = (char)iChar;

         try  {
            ap_bl.append(c);
         }  catch(NullPointerException npx)  {
            throw  new NullPointerException("ap_bl");
         }

         if(s_endingString != null)  {
            //There is an ending string;
            char[] ac = s_endingString.toCharArray();

            if(c == ac[ixEndStr])  {
               //The character just retrieved is equal to the
               //next character in the ending string.

               if(ixEndStr == (ac.length - 1))  {
                  //The entire string has been found. Done.
                  return ap_bl;
               }

               ixEndStr++;
            }  else  {
               ixEndStr = 0;
            }
         }
      }

      if(s_endingString != null)  {
         //Should have exited at the "return" above.
         throw  new EOFException("s_endingString " + (new String(s_endingString)) + " (is non-null, and was not found at the end of the web-page's source-code.");
      }
      return  ap_bl;
   }
}

Java - 如何从Web浏览器中获取文本？

5 个答案: