最近我被建议使用JSoup来解析和修改HTML文档。
但是,如果我有一个我想要修改的HTML文档(发送,存储在其他地方等等),如何在不更改原始文档的情况下进行操作呢?
假设我有一个像这样的HTML文件:
<html>
<head></head>
<body>
<p></p>
<h2>Title: title</h2>
<p></p>
<p>Name: </p>
<p>Address: </p>
<p>Phone Number: </p>
</body>
</html>
我想填写姓名,地址,电话号码和我想要的任何其他信息的相应数据,而不修改原始HTML文件,我如何使用JSoup进行此操作?
答案 0 :(得分:1)
一种可能更简单的解决方案是修改模板以使其占位符如下:
<html>
<head></head>
<body>
<p></p>
<h2>Title: title</h2>
<p></p>
<p>Name: <span id="name"></span></p>
<p>Address: <span id="address"></span></p>
<p>Phone Number: <span id="phone"></span></p>
</body>
</html>
然后以这种方式加载您的文档:
Document doc = Jsoup.parse("" +
"<html>\n" +
" <head></head>\n" +
" <body> \n" +
" <p></p>\n" +
" <h2>Title: title</h2>\n" +
" <p></p>\n" +
" <p>Name: <span id=\"name\"></span></p>\n" +
" <p>Address: <span id=\"address\"></span></p>\n" +
" <p>Phone Number: <span id=\"phone\"></span></p>\n" +
" </body>\n" +
"</html>");
doc.getElementById("name").text("Andrey");
doc.getElementById("address").text("Stackoverflow.com");
doc.getElementById("phone").text("secret!");
System.out.println(doc.html());
这会填写表格。
答案 1 :(得分:0)
@MarcoS有一个很好的解决方案,使用NodeTraversor在https://stackoverflow.com/a/6594828/1861357上创建要更改的节点列表,我只是稍微修改了他的方法,用节点中的数据替换节点(一组标签)加上你想要添加的任何信息。
要在内存中存储String,我使用静态StringBuilder
将HTML保存在内存中。
首先我们读取HTML文件(手动指定,可以更改),然后我们进行一系列检查以更改任何包含我们想要的数据的节点。
我在MarcoS解决方案中没有解决的一个问题是它将每个单词分开,而不是查看一行。但是我只使用' - '表示多个单词,否则它会将字符串直接放在该单词之后。
所以完整实施:
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.*;
public class memoryHTML
{
static String htmlLocation = "C:\\Users\\User\\";
static String fileName = "blah"; // Just for demonstration, easily modified.
static StringBuilder buildTmpHTML = new StringBuilder();
static StringBuilder buildHTML = new StringBuilder();
static String name = "John Doe";
static String address = "42 University Dr., Somewhere, Someplace";
static String phoneNumber = "(123) 456-7890";
public static void main(String[] args)
{
// You can send it the full path with the filename. I split them up because I used this for multiple files.
readHTML(htmlLocation, fileName);
modifyHTML();
System.out.println(buildHTML.toString());
// You need to clear the StringBuilder Object or it will remain in memory and build on each run.
buildTmpHTML.setLength(0);
buildHTML.setLength(0);
System.exit(0);
}
// Simply parse and build a StringBuilder for a temporary HTML file that will be modified in modifyHTML()
public static void readHTML(String directory, String fileName)
{
try
{
BufferedReader br = new BufferedReader(new FileReader(directory + fileName + ".html"));
String line;
while((line = br.readLine()) != null)
{
buildTmpHTML.append(line);
}
br.close();
}
catch (Exception e)
{
e.printStackTrace();
System.exit(1);
}
}
// Excellent method of parsing and modifying nodes in HTML files by @MarcoS at https://stackoverflow.com/a/6594828/1861357
// It has its small problems, but it does the trick.
public static void modifyHTML()
{
String htmld = buildTmpHTML.toString();
Document doc = Jsoup.parse(htmld);
final List<TextNode> nodesToChange = new ArrayList<TextNode>();
NodeTraversor nd = new NodeTraversor(new NodeVisitor()
{
@Override
public void tail(Node node, int depth)
{
if (node instanceof TextNode)
{
TextNode textNode = (TextNode) node;
nodesToChange.add(textNode);
}
}
@Override
public void head(Node node, int depth)
{
}
});
nd.traverse(doc.body());
for (TextNode textNode : nodesToChange)
{
Node newNode = buildElementForText(textNode);
textNode.replaceWith(newNode);
}
buildHTML.append(doc.html());
}
private static Node buildElementForText(TextNode textNode)
{
String text = textNode.getWholeText();
String[] words = text.trim().split(" ");
Set<String> units = new HashSet<String>();
for (String word : words)
units.add(word);
String newText = text;
for (String rpl : units)
{
if(rpl.contains("Name"))
newText = newText.replaceAll(rpl, "" + rpl + " " + name:));
if(rpl.contains("Address") || rpl.contains("Residence"))
newText = newText.replaceAll(rpl, "" + rpl + " " + address);
if(rpl.contains("Phone-Number") || rpl.contains("PhoneNumber"))
newText = newText.replaceAll(rpl, "" + rpl + " " + phoneNumber);
}
return new DataNode(newText, textNode.baseUri());
}
你会得到这个HTML(记得我把“电话号码”改为“电话号码”):
<html>
<head></head>
<body>
<p></p>
<h2>Title: title</h2>
<p></p>
<p>Name: John Doe </p>
<p>Address: 42 University Dr., Somewhere, Someplace</p>
<p>Phone-Number: (123) 456-7890</p>
</body>
</html>