我有一个类似这样的HTML代码。我正在使用Node.js进行网页抓取。
<div id="content_column">
<CENTER>01/23/2014</CENTER>
<BR> <B>Name : </B> GLUCK MARTIN <BR> <B>Address : </B> <BR>
<B>Profession : </B> MEDICINE <BR> <B>License No: </B> 077798 <BR>
<B>Date of Licensure : </B> 05/05/56 <BR> <B>Additional
Qualification : </B> <BR> <B> <A
href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> DECEASED
11/24/13 <BR> <B>Registered through last day of : </B> <BR> <B>Medical
School: </B> UNIVERSITY OF GENEVA <B> Degree Date :
</B> Not on file <BR>
<HR>
<div class="note">
(Use your browser's back key to return to licensee list.)<BR> <BR>
* Use of this online verification service signifies that you have read
and agree to the <A href="http://www.op.nysed.gov/usage.htm">terms
and conditions of use</A>. See <A href="http://www.op.nysed.gov/help.htm">HELP
glossary</A> for further explanations of terms used on this page. <BR>
<BR> <B>Note: </B> The Board of Regents does not discipline <i>physicians(medicine),
physician assistants,</i> or <i>specialist assistants.</i> The status of
individuals in these professions may be impacted by information
provided by the NYS Department of Health. To search for the latest
discipline actions against individuals in these professions, please
check the New York State Department of Health's <A
href="http://www.health.state.ny.us/nysdoh/opmc/main.htm"> Office
of Professional Medical Conduct</A> homepage.
</UL>
</div>
<HR>
<div class="note">
Further information on physicians may be found on the following
external sites (The State Education Department is not responsible for
the accuracy or completeness of information located on external
Internet addresses.): <BR> <BR> <a
href="http://www.abms.org/">American Board of Medical Specialties</a>
<BR> <BR> <a href="http://www.ama-assn.org/">American
Medical Association:</a> <BR> - For the general public: <a
href="http://www.ama-assn.org/aps/amahg.htm">AMA Physician
Select, On-line Doctor Finder</a><BR> <BR> -
For organizations that verify physician credentials: <a
href="http://www.ama-assn.org/physdata/physrel/physrel.htm">AMA
Physician Profiles</a> <BR> <BR> <a
href="http://www.aoa-net.org/">American Osteopathic Association,
AOA-Net</a> <BR> <BR> <a href="http://www.docboard.org/">Association
of State Medical Board Executive Directors-(A.I.M."DOCFINDER")</a> <BR>
<BR> <a href="http://www.nydoctorprofile.com/welcome.jsp">New
York State Department of Health Physician Profiles</a><BR> <BR>The
following sites provide additional information concerning the medical
profession: <BR> <BR> <a href="http://www.clearhq.org/">CLEAR
(Council on Licensure, Enforcement and Regulation)</a> <BR> <BR>
<a href="http://www.fsmb.org/">Federation of State Medical Boards</a><BR>
<BR>
</div>
<CENTER>
<BR> <IMG SRC="http://www.op.nysed.gov/Sedseal.jpg" WIDTH="100"
HEIGHT="101" ALT="Seal of the State Education Department"><BR>
<BR>
</CENTER>
</div>
如何找到不属于任何元素的值,在这种情况下,它们是GLUCK MARTIN,MEDICINE,077798,05 / 05/56等。
答案 0 :(得分:0)
使用jQuery很容易 - 组合不包含:
$("#content_column").not(":contains('GLUCK MARTIN')")
答案 1 :(得分:0)
答案 2 :(得分:0)
在节点中,我建议使用DOM而不是像regex这样的scrape工作。 jsdom是一个很好的,它允许您从片段中构建DOM。从那里,您可以查询document.documentElement(在我的示例中,我将使用jquery)并提取未包含在标记中的任何直接文本节点。
// Count all of the text not in a tag
var jsdom = require("jsdom");
jsdom.env(
"URL OR YOUR HTML STRING HERE",
["http://code.jquery.com/jquery.js"],
function (errors, window) {
var textNodes = window.$(window.document.documentElement)
.find(":not(iframe)")
.addBack()
.contents()
.filter(function() {
return this.nodeType == 3;
});
//do something with textNodes
}
);