我有以下html我试图使用jsoup解析Java中的对象。
我正在尝试遍历元素并提取所有“类”作为对象以生成时间表数据。每个“班级”都有时间,地点,讲师和描述等,但这不是问题。
所有元素都属于tt_details
类。每天都没有特定的父母与子女之间的关系,但我可以使用Elements dayNames = content.getElementsByClass("tt_day");
每天每天可以有不同数量的“课程”,因为你可以看到星期一有3个“课程”和星期二,所以正常的循环结构不起作用。我怎样才能做到这一点?
<div class='tt_details'>
<div class='tt_day'>Mon</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>11:00 - 13:00
<div class='tt_day_small'> (Mon)</div>
</div>
<div class='tt_detail'>Internet of Things<br/>E1010 - MAC Lab <br/></div>
<div class='tt_lecturer'>Loftus, M</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>13:00 - 14:00
<div class='tt_day_small'> (Mon)</div>
</div>
<div class='tt_detail'>Computer Systems & Networking<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>
<div class='tt_lecturer'>Lang, D</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>16:00 - 18:00
<div class='tt_day_small'> (Mon)</div>
</div>
<div class='tt_detail'>Intro.to Programming L8<br/>D2005 - Computer Laboratory (32) <br/></div>
<div class='tt_lecturer'>Kinsella,V</div>
</div>
<div class='tt_details'>
<div class='tt_day'>Tue</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>09:00 - 10:00
<div class='tt_day_small'> (Tue)</div>
</div>
<div class='tt_detail'>Mathematics 2<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>
<div class='tt_lecturer'>O'Regan,D</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>10:00 - 11:00
<div class='tt_day_small'> (Tue)</div>
</div>
<div class='tt_detail'>Mathematics 2<br/>E0017 - Tiered Classroom (106) <br/></div>
<div class='tt_lecturer'>O'Regan,D</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>11:00 - 12:00
<div class='tt_day_small'> (Tue)</div>
</div>
<div class='tt_detail'>Intro to Programming<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>
<div class='tt_lecturer'>Kinsella,V</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>16:00 - 17:00
<div class='tt_day_small'> (Tue)</div>
</div>
<div class='tt_detail'>Computer Systems & Networking<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>
<div class='tt_lecturer'>Lang, D</div>
</div>
答案 0 :(得分:2)
这样的事情可能有所帮助:
String html = ""
+"<div class='tt_details'>"
+" <div class='tt_day'>Mon</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>11:00 - 13:00"
+" <div class='tt_day_small'> (Mon)</div>"
+" </div>"
+" <div class='tt_detail'>Internet of Things<br/>E1010 - MAC Lab <br/></div>"
+" <div class='tt_lecturer'>Loftus, M</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>13:00 - 14:00"
+" <div class='tt_day_small'> (Mon)</div>"
+" </div>"
+" <div class='tt_detail'>Computer Systems & Networking<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
+" <div class='tt_lecturer'>Lang, D</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>16:00 - 18:00"
+" <div class='tt_day_small'> (Mon)</div>"
+" </div>"
+" <div class='tt_detail'>Intro.to Programming L8<br/>D2005 - Computer Laboratory (32) <br/></div>"
+" <div class='tt_lecturer'>Kinsella,V</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_day'>Tue</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>09:00 - 10:00"
+" <div class='tt_day_small'> (Tue)</div>"
+" </div>"
+" <div class='tt_detail'>Mathematics 2<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
+" <div class='tt_lecturer'>O'Regan,D</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>10:00 - 11:00"
+" <div class='tt_day_small'> (Tue)</div>"
+" </div>"
+" <div class='tt_detail'>Mathematics 2<br/>E0017 - Tiered Classroom (106) <br/></div>"
+" <div class='tt_lecturer'>O'Regan,D</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>11:00 - 12:00"
+" <div class='tt_day_small'> (Tue)</div>"
+" </div>"
+" <div class='tt_detail'>Intro to Programming<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
+" <div class='tt_lecturer'>Kinsella,V</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>16:00 - 17:00"
+" <div class='tt_day_small'> (Tue)</div>"
+" </div>"
+" <div class='tt_detail'>Computer Systems & Networking<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
+" <div class='tt_lecturer'>Lang, D</div>"
+"</div>"
;
Document doc = Jsoup.parse(html);
Elements courseEls = doc.select("div.tt_details:not(:has(div.tt_day))");
class Course{
public Course(String day, String time, String lecturer, String subject) {
super();
this.day = day;
this.time = time;
this.lecturer = lecturer;
this.subject = subject;
}
public String day;
public String time;
public String lecturer;
public String subject;
public String toString(){
return day + " : "+ time +" : "+ lecturer + " : "+ subject;
}
}
Map<String,List<Course>> coursesByDay = new HashMap<>();
for (Element courseEl : courseEls){
Element timeSlotEl = courseEl.select(".tt_timeslot").first();
String timeSlotStr = timeSlotEl.ownText();
String dayStr = timeSlotEl.select(".tt_day_small").first().text().trim().replace("(", "").replace(")", "");
String detailStr = courseEl.select(".tt_detail").first().text();
String lecturerStr = courseEl.select(".tt_lecturer").first().text();
Course course = new Course(dayStr, timeSlotStr, lecturerStr, detailStr);
List<Course> courses = coursesByDay.get(dayStr);
if (courses == null){
courses = new ArrayList<>();
coursesByDay.put(dayStr, courses);
}
courses.add(course);
}
//get all courses on Tue
List<Course> courses = coursesByDay.get("Tue");
for (Course c : courses){
System.out.println(c);
}
这会在白天创建一个包含课程的地图。因此,地图键是日期,它包含课程对象列表。
关于此的一些评论:
div.tt_details:not(:has(div.tt_day))
仅获取课程div,而忽略了日期div。这是可能的,因为有关当天的信息会在课程div中重复。答案 1 :(得分:1)
试试这个
static final String[] DETAILS = { "tt_timeslot", "tt_day_small", "tt_detail", "tt_lecturer" };
和
Document doc = Jsoup.parse(html);
String day = null;
for (Element e : doc.select("div.tt_details")) {
Elements days = e.select("div.tt_day");
if (days.size() > 0) {
day = days.get(0).text();
System.out.printf(" *** %s ***%n", day);
} else {
System.out.printf(" --------%n");
for (String cls : DETAILS) {
Elements elements = e.select("div." + cls);
if (elements.size() > 0)
System.out.printf("%24s : %s%n", cls, elements.get(0).text());
}
}
}
结果
*** Mon ***
--------
tt_timeslot : 11:00 - 13:00 (Mon)
tt_day_small : (Mon)
tt_detail : Internet of Things E1010 - MAC Lab
tt_lecturer : Loftus, M
--------
tt_timeslot : 13:00 - 14:00 (Mon)
tt_day_small : (Mon)
tt_detail : Computer Systems & Networking A0004 - Tiered Lecture Theatre (132)
tt_lecturer : Lang, D
--------
tt_timeslot : 16:00 - 18:00 (Mon)
tt_day_small : (Mon)
tt_detail : Intro.to Programming L8 D2005 - Computer Laboratory (32)
tt_lecturer : Kinsella,V
*** Tue ***
--------
tt_timeslot : 09:00 - 10:00 (Tue)
tt_day_small : (Tue)
tt_detail : Mathematics 2 A0004 - Tiered Lecture Theatre (132)
tt_lecturer : O'Regan,D
--------
tt_timeslot : 10:00 - 11:00 (Tue)
tt_day_small : (Tue)
tt_detail : Mathematics 2 E0017 - Tiered Classroom (106)
tt_lecturer : O'Regan,D
--------
tt_timeslot : 11:00 - 12:00 (Tue)
tt_day_small : (Tue)
tt_detail : Intro to Programming A0006 - Tiered Lecture Theatre (152)
tt_lecturer : Kinsella,V
--------
tt_timeslot : 16:00 - 17:00 (Tue)
tt_day_small : (Tue)
tt_detail : Computer Systems & Networking A0006 - Tiered Lecture Theatre (152)
tt_lecturer : Lang, D
答案 2 :(得分:0)
如果这是来自在线网页的HTML源代码,那么您可以将selenium用于此目的,为此您必须导入selenium jar。
我的建议 -
String datentime = driver.findElement(By.className("tt_timeslot")).getText();
如果你有相同的元素名称,那么使用唯一的id或css选择器或xpath。