我正在尝试从我的Web浏览器应用程序中解析来自页面的HTML字符串,以查找我正在尝试检索的数据的HTML。 (简而言之,我正在对Google图像结果进行一些网络抓取。)
我的函数find()似乎对于大小合适的字符串工作得很好,但是当它遇到完整的字符串时,它正在尝试解析的代码的HTML,它抱怨了一个StringIndexOutOfBoundsException。这是我的find()函数,以及我试图从中调用它的函数:
查找():
// helper functions
public static int find(String stringToFind, int startPos, String str) throws NullPointerException,
IllegalArgumentException
{
// make sure that neither argument is null and not an empty String
if ((stringToFind == null) || (str == null))
throw new NullPointerException("null arguments are not allowed.");
if ((stringToFind.equals("")) || (str.equals("")))
throw new IllegalArgumentException("String arguments must be non-empty.");
int position = startPos;
// while we are not at the end of the String and the stringToFind is not found
while (position != str.length())
{
// find the first character in the string
position = str.indexOf(stringToFind.charAt(0), position+1);
// if found
if (position != -1)
{
int j = 0;
// search the other characters in str for the other characters in stringToFind
// while there is a character in str that matches its respective character in
// stringToFind and we are not at the end of either str,stringToFind
int firstCharacterPosition = -1;
while ((str.charAt(position) == stringToFind.charAt(j)) &&
((position < str.length()) && (j < stringToFind.length())))
{
if (firstCharacterPosition == -1)
firstCharacterPosition = position;
// compare the next character in str with the next character in stringToFind
// if the characters match and the characters being matched is the last
// character in stringToFind
if ((str.charAt(++position) == stringToFind.charAt(++j)) &&
(j == stringToFind.length() - 1))
// we are done here
return firstCharacterPosition;
}
}
else break;
}
return -1;
}
使用find()的函数:
public String getUserQuery()
{
// find the element in the HTML that starts with "<input id=\"gbqfq\"" and return it
index = find("<input id=\"gbqfq\"", index, searchPageHTML);
System.out.printf("index == %d", index);
try
{
return searchPageHTML.substring(index,
searchPageHTML.indexOf('>', index));
}
catch (IndexOutOfBoundsException outOfBounds)
{
return "";
}
}
整个类(我传递的是HTML代码的大字符串):
import javax.swing.JFrame;
import javax.swing.JEditorPane;
public class SearchResultsHTMLParser
{
private String searchPageHTML;
private int index = -1;
public SearchResultsHTMLParser(String html)
{
this.searchPageHTML = html;
// setup a test GUI
JFrame frame = new JFrame("GoogleImageTest");
JEditorPane editorPane = new JEditorPane("text/html",
this.getImagesDiv());
frame.add(editorPane);
frame.pack();
frame.setVisible(true);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
}
// make methods that parse this.html for the user input and the images
/* The user input has an HTML id of "gbqfq", and the images all belong to the HTML class
* "rg_di". The function for the user input should return the value of the input field as a
* String, and the function for the images should simply return the substring that has all
* of the images in it. (This will be parsed further for each individual image.)
*/
public String getUserQuery()
{
// find the element in the HTML that starts with "<input id=\"gbqfq\"" and return it
index = find("<input id=\"gbqfq\"", index, searchPageHTML);
System.out.printf("index == %d", index);
try
{
return searchPageHTML.substring(index,
searchPageHTML.indexOf('>', index));
}
catch (IndexOutOfBoundsException outOfBounds)
{
return "";
}
}
/* This function will get the div with id="rg_s", and will probably not be used */
public String getImagesDiv()
{
System.out.println("index == " + index);
index = find("<div id=\"rg_s\"", index, searchPageHTML);
System.out.println("index == " + index);
System.out.printf("charAt(%d) == %c", index, searchPageHTML.charAt(index));
String startOfNextDiv = "<div jsl=";
int nextDivPos = find(startOfNextDiv, index, searchPageHTML);
// return the substring of searchPageHTML from the start of the found image div container
// to the end of it (it's ok if there is whitespace, so we could go up until the start of
// next div container)
return searchPageHTML.substring(index, nextDivPos);
}
// helper functions
public static int find(String stringToFind, int startPos, String str) throws NullPointerException,
IllegalArgumentException
{
// make sure that neither argument is null and not an empty String
if ((stringToFind == null) || (str == null))
throw new NullPointerException("null arguments are not allowed.");
if ((stringToFind.equals("")) || (str.equals("")))
throw new IllegalArgumentException("String arguments must be non-empty.");
int position = startPos;
// while we are not at the end of the String and the stringToFind is not found
while (position != str.length())
{
// find the first character in the string
position = str.indexOf(stringToFind.charAt(0), position+1);
// if found
if (position != -1)
{
int j = 0;
// search the other characters in str for the other characters in stringToFind
// while there is a character in str that matches its respective character in
// stringToFind and we are not at the end of either str,stringToFind
int firstCharacterPosition = -1;
while ((str.charAt(position) == stringToFind.charAt(j)) &&
((position < str.length()) && (j < stringToFind.length())))
{
if (firstCharacterPosition == -1)
firstCharacterPosition = position;
// compare the next character in str with the next character in stringToFind
// if the characters match and the characters being matched is the last
// character in stringToFind
if ((str.charAt(++position) == stringToFind.charAt(++j)) &&
(j == stringToFind.length() - 1))
// we are done here
return firstCharacterPosition;
}
}
else break;
}
return -1;
}
public static String getSubstringOf(String str, String subString, int pos)
{
// first, make a call to find(subString, pos, str)
int result = SearchResultsHTMLParser.find(subString, pos, str);
// return the substring if it exists, that is if find() != -1
return (result == -1) ? "" : subString;
}
}
我在考虑做什么
我正在考虑将大型String转换为大型char [],然后尝试重写我的函数以逐块使用它。我计划这个是因为我认为错误是字符串的绝对大小,它是任何Google图像搜索结果的HTML,长度是数十万个字符。
答案 0 :(得分:0)
我发现错误:由于某种原因,我搜索的元素在我获取的Java代码中不存在。它与我的函数无关,而且我的函数碰巧与我不知道的已存在的函数做同样的事情(直到现在):http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#indexOf(java.lang.String,int)
System.out.println("Thanks, Kick Buttowski!!");