非常大的字符串上的StringIndexOutOfBoundsException

时间:2014-10-15 00:50:07

标签: java html string

我正在尝试从我的Web浏览器应用程序中解析来自页面的HTML字符串,以查找我正在尝试检索的数据的HTML。 (简而言之,我正在对Google图像结果进行一些网络抓取。)

我的函数find()似乎对于大小合适的字符串工作得很好,但是当它遇到完整的字符串时,它正在尝试解析的代码的HTML,它抱怨了一个StringIndexOutOfBoundsException。这是我的find()函数,以及我试图从中调用它的函数:

查找():

// helper functions
    public static int find(String stringToFind, int startPos, String str) throws NullPointerException,
        IllegalArgumentException
    {
        // make sure that neither argument is null and not an empty String
        if ((stringToFind == null) || (str == null))
            throw new NullPointerException("null arguments are not allowed.");
        if ((stringToFind.equals("")) || (str.equals("")))
            throw new IllegalArgumentException("String arguments must be non-empty.");
        int position = startPos;
        // while we are not at the end of the String and the stringToFind is not found
        while (position != str.length())
        {
            // find the first character in the string
            position = str.indexOf(stringToFind.charAt(0), position+1);
            // if found
            if (position != -1)
            {
                int j = 0;
                // search the other characters in str for the other characters in stringToFind
                // while there is a character in str that matches its respective character in 
                //  stringToFind and we are not at the end of either str,stringToFind
                int firstCharacterPosition = -1;
                while ((str.charAt(position) == stringToFind.charAt(j)) &&
                        ((position < str.length()) && (j < stringToFind.length())))
                {
                    if (firstCharacterPosition == -1)
                        firstCharacterPosition = position;
                    // compare the next character in str with the next character in stringToFind
                    // if the characters match and the characters being matched is the last 
                    //  character in stringToFind
                    if ((str.charAt(++position) == stringToFind.charAt(++j)) &&
                        (j == stringToFind.length() - 1))
                        // we are done here
                        return firstCharacterPosition;
                }
            }
            else break;
        }
        return -1;  
    }

使用find()的函数:

public String getUserQuery()
    {
        // find the element in the HTML that starts with "<input id=\"gbqfq\"" and return it
        index = find("<input id=\"gbqfq\"", index, searchPageHTML);
        System.out.printf("index == %d", index);
        try
        {
            return searchPageHTML.substring(index, 
                searchPageHTML.indexOf('>', index));
        }
        catch (IndexOutOfBoundsException outOfBounds)
        {
            return "";
        }
    }

整个类(我传递的是HTML代码的大字符串):

import javax.swing.JFrame;
import javax.swing.JEditorPane;


public class SearchResultsHTMLParser
{
    private String searchPageHTML;
    private int index = -1;
    public SearchResultsHTMLParser(String html)
    {
        this.searchPageHTML = html;
        // setup a test GUI
        JFrame frame = new JFrame("GoogleImageTest");
        JEditorPane editorPane = new JEditorPane("text/html",
            this.getImagesDiv());
        frame.add(editorPane);
        frame.pack();
        frame.setVisible(true);
        frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
    }

    // make methods that parse this.html for the user input and the images
    /* The user input has an HTML id of "gbqfq", and the images all belong to the HTML class 
     * "rg_di". The function for the user input should return the value of the input field as a
     * String, and the function for the images should simply return the substring that has all 
     * of the images in it. (This will be parsed further for each individual image.)
     */
    public String getUserQuery()
    {
        // find the element in the HTML that starts with "<input id=\"gbqfq\"" and return it
        index = find("<input id=\"gbqfq\"", index, searchPageHTML);
        System.out.printf("index == %d", index);
        try
        {
            return searchPageHTML.substring(index, 
                searchPageHTML.indexOf('>', index));
        }
        catch (IndexOutOfBoundsException outOfBounds)
        {
            return "";
        }
    }

    /* This function will get the div with id="rg_s", and will probably not be used */
    public String getImagesDiv()
    {
        System.out.println("index == " + index);
        index = find("<div id=\"rg_s\"", index, searchPageHTML);
        System.out.println("index == " + index);
        System.out.printf("charAt(%d) == %c", index, searchPageHTML.charAt(index));
        String startOfNextDiv = "<div jsl="; 
        int nextDivPos = find(startOfNextDiv, index, searchPageHTML);
        // return the substring of searchPageHTML from the start of the found image div container
        //  to the end of it (it's ok if there is whitespace, so we could go up until the start of
        //  next div container)
        return searchPageHTML.substring(index, nextDivPos);
    }

    // helper functions
    public static int find(String stringToFind, int startPos, String str) throws NullPointerException,
        IllegalArgumentException
    {
        // make sure that neither argument is null and not an empty String
        if ((stringToFind == null) || (str == null))
            throw new NullPointerException("null arguments are not allowed.");
        if ((stringToFind.equals("")) || (str.equals("")))
            throw new IllegalArgumentException("String arguments must be non-empty.");
        int position = startPos;
        // while we are not at the end of the String and the stringToFind is not found
        while (position != str.length())
        {
            // find the first character in the string
            position = str.indexOf(stringToFind.charAt(0), position+1);
            // if found
            if (position != -1)
            {
                int j = 0;
                // search the other characters in str for the other characters in stringToFind
                // while there is a character in str that matches its respective character in 
                //  stringToFind and we are not at the end of either str,stringToFind
                int firstCharacterPosition = -1;
                while ((str.charAt(position) == stringToFind.charAt(j)) &&
                        ((position < str.length()) && (j < stringToFind.length())))
                {
                    if (firstCharacterPosition == -1)
                        firstCharacterPosition = position;
                    // compare the next character in str with the next character in stringToFind
                    // if the characters match and the characters being matched is the last 
                    //  character in stringToFind
                    if ((str.charAt(++position) == stringToFind.charAt(++j)) &&
                        (j == stringToFind.length() - 1))
                        // we are done here
                        return firstCharacterPosition;
                }
            }
            else break;
        }
        return -1;  
    }

    public static String getSubstringOf(String str, String subString, int pos)
    {
        // first, make a call to find(subString, pos, str)
        int result = SearchResultsHTMLParser.find(subString, pos, str);
        // return the substring if it exists, that is if find() != -1
        return (result == -1) ? "" : subString;
    }

}

我在考虑做什么

我正在考虑将大型String转换为大型char [],然后尝试重写我的函数以逐块使用它。我计划这个是因为我认为错误是字符串的绝对大小,它是任何Google图像搜索结果的HTML,长度是数十万个字符。

1 个答案:

答案 0 :(得分:0)

我发现错误:由于某种原因,我搜索的元素在我获取的Java代码中不存在。它与我的函数无关,而且我的函数碰巧与我不知道的已存在的函数做同样的事情(直到现在):http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#indexOf(java.lang.String,int)

System.out.println("Thanks, Kick Buttowski!!");