Question

我有关于C和java的基本知识。我必须创建一个java项目来读取此表单中的html文件

该文件是HTML格式，我希望<pre>标记内包含相同的信息。文件内容如下：

<html>
<pre>


Extraction of Roots by Repeated Subtractions for Digital Computers<-- i wand to take this line the title

CACM December, 1958

Sugai, I. <--- and this line

CA581202 JB March 22, 1978  8:29 PM

2   5   2
2   5   2
2   5   2

</pre>
</html>

如果文件中包含标题和作者，我只想拍摄标题和作者。

我写了这段代码，但我无法接受作者。我得到了无用的信息

StringBuilder builder = new StringBuilder();
Element link;
String text,str,name,title,name2=null; 
Document doc;
File in = new File("path");
doc = Jsoup.parse(in, null);
link = doc.select("pre").first();
text = doc.body().text();
String []lines = text.split("[\r\n]+");
for (String string : lines) {
    if (builder.length() > 0) {
        builder.append(" ");
    }
    builder.append(string);
}   
str = builder.toString();
String[] strings = str.split(",");
title=strings[0];
name=strings[2];

Answer 1

如果您的所有文件都具有相同的格式，则可以执行此操作。运行getTxt后，您只需访问数组中的第3个和第5个元素。或者你可以解析文件。在pre＆gt;之间抓住一切和日期。然后抓住数据与某种形式的CA之间的关系。1978年3月22日下午8:29。

static public ArrayList<String> getTxt(String urlString){
    ArrayList<String> list=new ArrayList<String>();
    //Access the page
    try {
        // Create a URL for the desired page
        URL url = new URL(urlString);

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            list.add(str);
            // str is one line of text; readLine() strips the newline character(s)
        }
        in.close();             
    } catch (MalformedURLException e) {
    } catch (IOException e) {
    }          
    return list;
}

我不能只接受一部分字符串

1 个答案: