我想从以下div中的网页解析数据:
我想从网页中解析数据,可以提供以下内容:
<div class="InseratDaten">
<div class="Art">Rent</div>
<div class="Ort">TestCity 3., Roads Street</div>
<div class="Preis"><span class='Label'>Miete:</span> 950 EUR</div>
<div class="Groesse"><span class='Label'>Fläche:</span> 72 m²</div>
<div class="Zimmer"><span class='Label'>Zimmer:</span> 3</div>
</div>
然而,有时这些结构看起来完全不同,如:
<div class="InseratDaten">
<div class="Art">Rent</div>
<div class="Ort">Test 3., Road Street</div>
<div class="Preis"><span class='Label'>Miete:</span> 919 EUR</div>
<div class="Groesse"><span class='Label'>Fläche:</span> 84 m²</div>
<div class="Zimmer"><span class='Label'>Zimmer:</span> 3</div>
<div class="EigTitel">weitere Eigenschaften:</div>
<div class='EigListe'>Shower, Balcony, Dog</div>
</div>
或
<div class="InseratDaten">
<div class="Art">Sale</div>
<div class="Ort">Test 4., Road Street</div>
<div class="Preis"><span class='Label'>Miete:</span> 919 EUR</div>
<div class="Groesse"><span class='Label'>Fläche:</span> 84 m²</div>
</div>
正如您所看到的,<div class="EigTitel">
扩展了后面的代码,或者缺少某些元素。
目前我正在解析我的数据:
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String html = htmlParseData.getHtml();
Document doc = Jsoup.parseBodyFragment(html);
Elements title = doc.select("div[class=Title]");
Elements art = doc.select("div[class=Art]");
Elements location = doc.select("div[class=Ort]");
Elements price = doc.select("div[class=Preis]");
Elements size = doc.select("div[class=Groesse]");
Elements numberOfRooms = doc.select("div[class=Zimmer]");
Elements furtherProperties = doc.select("div[class=EigListe]");
/**
* get each element as List
*/
if (!(art.isEmpty()) && !(location.isEmpty()) && !(title.isEmpty()) && !(price.isEmpty())) {
//iterate over art cause all elems have the same size
titleList = new ArrayList<String>();
artList = new ArrayList<String>();
locationList = new ArrayList<String>();
priceList = new ArrayList<String>();
sizeList = new ArrayList<String>();
numberOfRoomsList = new ArrayList<String>();
furtherPropertiesList = new ArrayList<String>();
//price
for (Element element : price) {
priceList.add(element.text().toString());
}
//size
for (Element element : size) {
sizeList.add(element.text().toString());
}
//numberOfRooms
for (Element element : numberOfRooms) {
numberOfRoomsList.add(element.text().toString());
}
//furtherProperties
for (Element element : furtherProperties) {
furtherPropertiesList.add(element.text().toString());
}
//location
for (Element element : location) {
locationList.add(element.text().toString());
}
//art
for (Element element : art) {
artList.add(element.text().toString());
}
//title
for (Element element : title) {
titleList.add(element.text().toString());
}
log.info(ListstoString());
//add everything to the main domain List
for (int i = 0; i < locationList.size(); i++) {
Property prop = new Property();
//price
prop.setPrice(priceList.get(i));
//size
prop.setSize(sizeList.get(i));
//number of rooms
prop.setNumberOfRooms(numberOfRoomsList.get(i));
//furtherProperties
prop.setFurtherProperties(furtherPropertiesList.get(i));
//location
prop.setLocation(locationList.get(i));
//art
prop.setTransactionType(artList.get(i));
//title
prop.setTitle(titleList.get(i));
//set date
prop.setCrawlingDate(new Date());
list.add(prop);
}
log.info(list.toString());
}
}
我的问题是,在某些情况下,我的列表可能会有不同的长度,因为数据可能会丢失,因此我收到错误:
[sizeList=16, priceList=16, locationList=16, numberOfRoomsList=12, furtherPropertiesList=12]
我想把null元素放在div没有这些属性的地方,以保持我的数据一致。我想这与jsoup有关,将null元素放在那里?有任何想法实现吗?
我非常感谢你的回答!
答案 0 :(得分:-1)
您可以创建预定义大小的列表,如:
titleList = Arrays.asList(new String[locationList.size()]);
然后在设置元素时使用索引:
for (int i = 0; i < title.size(); i++) {
titleList.set(i, title.get(i).text().toString());
}