我正在使用一些具有旧IIS6的web服务,他正在撤回的只是HTML,没有JSON,XML。当我得到HTML时,我需要正确解析数据。唯一的问题是HTML非常混乱并且没有正确格式化。
以下是我使用它的服务使用GET。
http://www.zefix.ch/WebServices/Zefix/Zefix.asmx/SearchFirm?name=Dedal
它会像我这样返回HTML
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:ino="http://namespaces.softwareag.com/tamino/response2" xmlns:xql="http://metalab.unc.edu/xql/" xmlns:xq="http://namespaces.softwareag.com/tamino/XQuery/result">
<head>
<META http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Firmenname=dedal , suche_nach=-, Rechtsform=, Sitz=, Sitz Gemeinde=, Firmennummer=, language=1, phonetisch=no</title>
</head>
<body>
<font face="arial" size="2">
<b>Suche nach Firma: <i>dedal </i></b>
<br />
<b>(10 Suchresultate am 03.12.2015 um 08:30) [Stand: 03.12.2015 235/2015]</b>
<br />Zentraler Firmenindex - Eidgenössisches Amt für das Handelsregister<hr /><b>DEDAL FILMS, Albrecht</b><i> in <a target="_top" href="/info/ger/VS626.htm">Lens</a></i>, Einzelunt., <a target="result" href="/WebServices/Zefix/Zefix.asmx/ShowFirm?parId=1058678&parChnr=CH-626.1.014.253-3&language=1">+</a>, <a target="_blank" href="http://vs.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGHTML?chnr=6261014253&amt=626&toBeModified=0&validOnly=0&lang=1&sort=0">CHE-150.481.375</a><p />DEDAL TRADING SA in liquidazione<i> in <a target="_top" href="/info/ger/TI501.htm">Mendrisio</a></i>, AG, gelöscht: Publ.Dat. 29.07.2005,
<a target="result" href="/WebServices/Zefix/Zefix.asmx/ShowFirm?parId=537570&parChnr=CH-524.3.009.149-2&language=1">+</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGHTML?chnr=5243009149&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">CHE-101.054.476</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGPDF?chnr=5243009149&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">PDF</a><p /><b>DEDALE SA</b><i> in <a target="_top" href="/info/ger/VS626.htm">Chermignon</a></i>, AG, <a target="result" href="/WebServices/Zefix/Zefix.asmx/ShowFirm?parId=1139492&parChnr=CH-626.3.014.970-6&language=1">+</a>, <a target="_blank" href="http://vs.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGHTML?chnr=6263014970&amt=626&toBeModified=0&validOnly=0&lang=1&sort=0">CHE-196.615.628</a><p /><b>Dedale Solutions, Putallaz & Co</b><i> in <a target="_top" href="/info/ger/GE660.htm">Genève</a></i>, Kommanditgesell., <a target="result" href="/WebServices/Zefix/Zefix.asmx/ShowFirm?parId=1049329&parChnr=CH-660.0.412.012-4&language=1">+</a>, <a target="_blank" href="http://ge.ch/hrcintapp/externalCompanyReport.action?companyOfrcId13=CH-660-0412012-4&ofrcLanguage=1">CHE-416.967.677</a><p />Dedalo Promotion Limited Liability Company, Cheyenne, Wyoming USA, succursale di Paradiso<i> in <a target="_top" href="/info/ger/TI501.htm">Paradiso</a></i>, Ausl. ZN, gelöscht: Publ.Dat. 25.05.2010,
<a target="result" href="/WebServices/Zefix/Zefix.asmx/ShowFirm?parId=349506&parChnr=CH-514.9.009.263-7&language=1">+</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGHTML?chnr=5149009263&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">CHE-104.147.677</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGPDF?chnr=5149009263&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">PDF</a><p /><b>Dedalo SA</b><i> in <a target="_top" href="/info/ger/TI501.htm">Chiasso</a></i>, AG, <a target="result" href="/WebServices/Zefix/Zefix.asmx/ShowFirm?parId=1144906&parChnr=CH-501.3.017.898-0&language=1">+++</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGHTML?chnr=5013017898&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">CHE-226.878.749</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGPDF?chnr=5013017898&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">PDF</a><p /><b>Dedalos R&D</b><i> in <a target="_top" href="/info/ger/TI501.htm">Bellinzona</a></i>, Verein, <a target="result" href="/WebServices/Zefix/Zefix.asmx/ShowFirm?parId=431256&parChnr=CH-500.6.004.353-6&language=1">+</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGHTML?chnr=5006004353&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">CHE-104.771.605</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGPDF?chnr=5006004353&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">PDF</a><p /><b>DEDALUS DIVERS Sagl</b><i> in <a target="_top" href="/info/ger/TI501.htm">Gordola</a></i>, GmbH, <a target="result" href="/WebServices/Zefix/Zefix.asmx/ShowFirm?parId=1107221&parChnr=CH-501.4.016.642-1&language=1">+</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGHTML?chnr=5014016642&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">CHE-167.108.200</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGPDF?chnr=5014016642&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">PDF</a><p /><b>Dedalus SA</b><i> in <a target="_top" href="/info/ger/TI501.htm">Breggia</a></i>, AG, <a target="result" href="/WebServices/Zefix/Zefix.asmx/ShowFirm?parId=404462&parChnr=CH-524.3.006.007-5&language=1">+++</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGHTML?chnr=5243006007&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">CHE-106.145.979</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGPDF?chnr=5243006007&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">PDF</a><p /><b>EDIL DEDALO S.A.G.L.</b><i> in <a target="_top" href="/info/ger/TI501.htm">Balerna</a></i>, GmbH, <a target="result" href="/WebServices/Zefix/Zefix.asmx/ShowFirm?parId=1150282&parChnr=CH-501.4.017.854-1&language=1">+++</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGHTML?chnr=5014017854&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">CHE-232.905.567</a>, <a target="_blank" href="http://ti.powernet.ch/webservices/inet/HRG/HRG.asmx/getHRGPDF?chnr=5014017854&amt=501&toBeModified=0&validOnly=0&lang=1&sort=0">PDF</a><p /><hr size="5" /></font>
<script type="text/javascript">
var _paq = _paq || [];
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u = (("https:" == document.location.protocol) ? "https" : "http") + "://www.e-service.admin.ch/analytics/";
_paq.push(['setTrackerUrl', u + 'piwik.php']);
_paq.push(['setSiteId', 4]);
var d = document,
g = d.createElement('script'),
s = d.getElementsByTagName('script')[0];
g.type = 'text/javascript';
g.defer = true;
g.async = true;
g.src = u + 'piwik.js';
s.parentNode.insertBefore(g, s);
})();
</script>
<noscript>
<p>
<img src="http://www.e-service.admin.ch/analytics/piwik.php?idsite=4" style="border:0;" alt="" />
</p>
</noscript>
</body>
</html>
但我不需要所有数据,我使用 Simple_html_dom https://github.com/samacs/simple_html_dom
我得到了我可以处理的问题,唯一的问题是解析该字符串,我需要使用这样的值来获取HTML
<p>COMPANY NAME</p>
<a class="che" href="LINK CHE">CHE</a>
<a class="pdf" href="PDF LINK">PDF</a>
问题是有时没有PDF,我不知道要解析什么:(