我一直在尝试从HTML表中提取数据的不同方法,例如使用xpath。表没有任何类,所以我不知道如何使用没有类或Id的xpath。正在从rss xml文件中检索此数据。我目前正在使用DOM。在我提取数据之后,我将尝试按作业标题对表进行排序
这是我的PHP代码
$html='';
$xml= simplexml_load_file($url) or die("ERROR: Cannot connect to url\n check if report still exist in the Gradleaders system");
/*What we do here in this loop is retrieve all content inside the encoded content,
*which includes the CDATA information. This is where the HTML and styling is included.
*/
foreach($xml->channel->item as $cont){
$html=''.$cont->children('content',true)->encoded.'<br>'; //actual tag name is encoded
}
$htmlParser= new DOMDocument(); //to parse html using DOMDocument
libxml_use_internal_errors(true); // your HTML gives parser warnings, keep them internal
$htmlParser->loadHTML($html); //Loaded the html string we took from simple xml
$htmlParser->preserveWhiteSpace = false;
$tables= $htmlParser->getElementsByTagName('table');
$rows= $tables->item(0)->getElementsByTagName('tr');
foreach($rows as $row){
$cols = $row->getElementsByTagName('td');
echo $cols;
}
这是我从
中提取信息的HTML
<table cellpadding='1' cellspacing='2'>
<tr>
<td><b>Job Title:</b></td>
<td>Job Example </td>
</tr>
<tr>
<td><b>Job ID:</b></td>
<td>23992</td>
</tr>
<tr>
<td><b>Job Description:</b></td>
<td>Just a job example </td>
</tr>
<tr>
<td><b>Job Category:</b></td>
<td>Work-study Position</td>
</tr>
<tr>
<td><b>Position Type:</b></td>
<td>Work-study</td>
</tr>
<tr>
<td><b>Applicant Type:</b></td>
<td>Work-study</td>
</tr>
<tr>
<td><b>Status:</b></td>
<td>Active</td>
</tr>
<tr>
<td colspan='2'><b><a href='https://www.myjobs.com/tuemp/job_view.aspx?token=I1iBwstbTs2pau+SjrYfWA%3d%3d'>Click to View More</a></b></td>
</tr>
</table>
&#13;
答案 0 :(得分:5)
您可以xpath
使用query('//td')
并使用td
检索C14N()
html,例如:
$dom = new DOMDocument();
$dom->loadHtml($html);
$x = new DOMXpath($dom);
foreach($x->query('//td') as $td){
echo $td->C14N();
//if just need the text use:
//echo $td->textContent;
}
<强>输出:强>
<td><b>Job Title:</b></td>
<td>Job Example </td>
<td><b>Job ID:</b></td>
...
在失败时将规范化节点作为
string
或FALSE
返回
<强>更新强>
另一个问题是,如何获取单个表格数据?例如, 抓住,工作ID
使用XPath
contains
,即:
foreach($x->query('//td[contains(., "Job ID:")]') as $td){
echo $td->textContent;
}
更新V2:
如何在此之后获取下一个表格数据(以实际获得作业 Id)?
使用following-sibling::*[1]
,即:
echo $x->query('//td[contains(*, "Job ID:")]/following-sibling::*[1]')->item(0)->textContent;
//23992
答案 1 :(得分:-2)
$xpathParser = new DOMXPath($htmlParser);
$tableDataNodes = $xpathParser->evaluate("//table/tr/td")
for ($x=0;$x<$tableDataNodes.length;$x++) {
echo $tableDataNodes[$x];
}