我收到一个非常不规则的HTML文件。
<?php
include 'connection.php';
$email=addslashes ($_POST['cEmail']);
$sqlEmail = "select cEmail from Client where cEmail = '$email'";
$exist = mysql_fetch_row($sqlEmail);
if (isset($_POST['submit']))
{
if($exist == false){
$name = addslashes ($_POST['cName']);
$surname=addslashes ($_POST['cSurname']);
$email=addslashes ($_POST['cEmail']);
$phone=addslashes ($_POST['cPhone']);
$otherPhone=addslashes($_POST['cOtherPhone']);
$languages=implode(' | ', $_POST['cLanguages']);
$address=addslashes ($_POST['cAddress']);
$neighborhood=addslashes ($_POST['cNeighborhood']);
$pswd=addslashes($_POST['cPswd']);
$service= addslashes ($_POST['cService']);
$sql = "INSERT INTO Client(cName, cSurname, cEmail, cPhone, cOtherPhone, cLanguages, cAddress, cNeighborhood, cPswd, cService)
VALUES ('$name', '$surname', '$email', $phone, $otherPhone, '$languages', '$address', '$neighborhood', '$pswd', '$service')";
if ($conn->query($sql) === TRUE) {
echo "New record created successfully";
}
}else{
echo "the email exists";
}
}
?>
我需要提取此文件的TD内的每个文本,结果应该是这样的:
<tr class="" rel="30887721">
<td class="leftborder timestamp" rel="1472298782">
<span class="updatets "> 9mins </span>
</td>
<td>
<span>
<style>
.NFK2{display:none}
.gPwA{display:inline}
.Zb70{display:none}
.vFY2{display:inline}
</style>
<span style="display:none">54</span>
<span class="NFK2">54</span>
<div style="display:none">54</div>
<span class="vFY2">124</span>
<span style="display: inline">.</span>
<span class="7">240</span>
<span class="235">.</span>
<div style="display:none">17</div>
<span class="NFK2">62</span>
<span></span>
<span style="display:none">121</span>
<span></span>
<span style="display: inline">187</span>
<span style="display:none">190</span>
<span class="Zb70">190</span>
<span class="NFK2">197</span>
<span></span>
<span style="display: inline">.</span>
<span class="248">80</span>
<div style="display:none">152</div>
<span style="display:none">166</span>
<div style="display:none">166</div>
</span>
</td>
<td> 80 </td>
<td style="text-align:left" class="country" rel="cn">
<span style="white-space:nowrap;">
<img src="/images/1x1.png" style="width: 16px; height: 11px; margin-right: 5px;" class="flags-cn" alt="flag "/>
China
</span>
</td>
<td>
<div class="progress-indicator response_time" style="width: 114px" value="1314" levels="speed" rel="1314">
<div class="indicator" style="width: 87%; background-color: rgb(0, 173, 173)"></div>
</div>
</td>
<td>
<div class="progress-indicator connection_time" style="width: 114px" title="" rel="427" value="427" levels="speed">
<div class="indicator" style="width: 91%; background-color: rgb(0, 173, 173)"></div>
</div>
</td>
<td> HTTP </td>
<td nowrap> High +KA </td>
</tr>
<tr class="altshade" rel="30887719">
<td class="leftborder timestamp" rel="1472298723">
<span class="updatets "> 10mins </span>
</td>
<td>
<span>
<style>
.ZQOg{display:none}
.hAKN{display:inline}
.sZYH{display:none}
.euLE{display:inline}
.pnDV{display:none}
.yf2r{display:inline}
</style>
<span style="display:none">30</span>
<div style="display:none">30</div>
<span class="yf2r">124</span>
<span style="display: inline">.</span>
<span style="display:none">62</span>
<span style="display: inline">244</span>
<span style="display: inline">.</span>
<span class="pnDV">6</span>
<div style="display:none">6</div>
<span class="ZQOg">39</span>
<div style="display:none">39</div>
<span style="display:none">71</span>
<div style="display:none">71</div>
<span style="display:none">103</span>
<span class="sZYH">103</span>
<span></span>
<span class="euLE">157</span>
<span style="display:none">188</span>
<div style="display:none">188</div>
<div style="display:none">208</div>
<span style="display:none">220</span>
<div style="display:none">220</div>
<span class="sZYH">231</span>
<span style="display:none">241</span>
<span class="hAKN">.</span>
<span class="sZYH">26</span>
<span></span>
<span class="sZYH">31</span>
<span></span>
<span style="display:none">66</span>
<div style="display:none">66</div>
<span style="display:none">84</span>
<span class="pnDV">84</span>
<span></span>
<span style="display:none">166</span>
<span class="sZYH">166</span>
<div style="display:none">166</div>
<span style="display:none">207</span>
<span></span>
<span style="display: inline">209</span>
<span class="sZYH">212</span>
<div style="display:none">212</div>
<span style="display:none">241</span>
<span class="pnDV">241</span>
</span>
</td>
<td> 80 </td>
<td style="text-align:left" class="country" rel="hk">
<span style="white-space:nowrap;">
<img src="/images/1x1.png" style="width: 16px; height: 11px; margin-right: 5px;" class="flags-hk" alt="flag "/>
Hong Kong
</span>
</td>
<td>
<div class="progress-indicator response_time" style="width: 114px" value="1165" levels="speed" rel="1165">
<div class="indicator" style="width: 88%; background-color: rgb(0, 173, 173)"></div>
</div>
</td>
<td>
<div class="progress-indicator connection_time" style="width: 114px" title="" rel="287" value="287" levels="speed">
<div class="indicator" style="width: 94%; background-color: rgb(0, 173, 173)"></div>
</div>
</td>
<td> HTTP </td>
<td nowrap> High +KA </td>
</tr>
我面临很多问题才能得到这个结果
第一个是因为无效的标记,例如跨度跨度,跨度内的样式等...
第二个是因为它需要一些实时解析,以评估其中的9mins 124.240.187.80 80 China HTTP High +KA
10mins 124.244.157.209 80 Hong Kong HTTP High +KA
标签。
“样式”标签和“样式”属性说明应显示哪些元素以及哪些元素不存在。
我使用C#+ CsQuery来提取此结果,但直到现在还没有成功。
<style>
IP var返回类似:
CQ dom = CQ.Create(text);
CQ tr = dom.Select("table tr");
foreach(var item in tr)
{
string lastCheck = tr.Select("td:eq(0)").Text(); //9mins
string ip = tr.Select("td:eq(1)").Text();
string port = tr.Select("td:eq(2)").Text(); //80
string country = tr.Select("td:eq(3)").Text(); //China
string protocol = tr.Select("td:eq(6)").Text(); //HTTP
string anonymity = tr.Select("td:eq(7)").Text(); //High + KA
}
如果我更改IP var以获取HTML:
".Yj0s{display:none}\n.YSE7{display:inline}\n.zURn{display:none}\n.odWZ{display:inline}637891919292106106137183183183188245245254.85135.166.117177214214225"
它会返回如下内容:
string ip = tr.Select("td:eq(1)").Html();
如何让IP显示正确的值?
答案 0 :(得分:1)
我认为你需要做一些事情:
使用style="display:none"
从DOM中删除任何元素。这可以在CsQuery中轻松完成:
dom.Select("*:hidden").Remove();
解析<style>
元素的内容,并删除因<style>
元素内的声明而未显示的元素。不要使用正则表达式来执行此操作,而是让我们正确地执行操作。让我们使用ExCSS来解析CSS。这是一个采用CsQuery选择器的方法,使用ExCSS解析所有<style>
元素并删除样式设置为display: none
的所有元素:
void RemoveElementsHiddenByStyles(CQ selector)
{
var parser = new Parser();
foreach (IDomElement style in selector.Select("style"))
{
StyleSheet stylesheet = parser.Parse(style.InnerText);
foreach (StyleRule styleRule in stylesheet.StyleRules)
{
if (styleRule.Declarations.Any(d => d.Name == "display" && d.Term.ToString() == "none"))
{
selector.Select(styleRule.Selector.ToString()).Remove();
}
}
}
}
一旦解析了每个<style>
元素的内容,就可以删除它。
如果一行中的样式声明与另一行中的样式声明冲突,请注意逐行执行此操作。
从结果元素文本中删除所有空格。我会留给您写一个合适的RemoveAllWhitespace
方法。 This answer可能会有所帮助。
总而言之,我们有以下几点:
CQ dom = CQ.Create(text);
dom.Select("*:hidden").Remove();
CQ rows = dom.Select("table tr");
foreach (var item in rows)
{
CQ row = CQ.Create(item);
RemoveElementsHiddenByStyles(row);
row.Select("style").Remove();
string lastCheck = row.Select("td:eq(0)").Text().Trim(); //9mins
string ip = RemoveAllWhitespace(row.Select("td:eq(1)").Text()); //124.240.187.80
string port = row.Select("td:eq(2)").Text().Trim(); //80
string country = row.Select("td:eq(3)").Text().Trim(); //China
string protocol = row.Select("td:eq(6)").Text().Trim(); //HTTP
string anonymity = row.Select("td:eq(7)").Text().Trim(); //High + KA
}
另请注意,我已避免使用tr
作为变量名称:在代码tr
中包含所有行的列表,但在循环体中,它看起来像如果您将它用于单个行。