使用javascript查找那些不在任何元素内的文本

时间:2014-01-23 16:56:21

标签: javascript jquery

我有一个类似这样的HTML代码。我正在使用Node.js进行网页抓取。

<div id="content_column">

<CENTER>01/23/2014</CENTER>
<BR> <B>Name : </B> GLUCK MARTIN <BR> <B>Address : </B> <BR>
<B>Profession : </B> MEDICINE <BR> <B>License No: </B> 077798 <BR>
<B>Date of Licensure : </B> 05/05/56 <BR> <B>Additional
    Qualification : </B> &nbsp; <BR> <B> <A
    href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> DECEASED
11/24/13 <BR> <B>Registered through last day of : </B> <BR> <B>Medical
    School: </B> UNIVERSITY OF GENEVA <B>&nbsp;&nbsp;&nbsp; Degree Date :
</B> Not on file <BR>
<HR>
<div class="note">
    (Use your browser's back key to return to licensee list.)<BR> <BR>
    * Use of this online verification service signifies that you have read
    and agree to the <A href="http://www.op.nysed.gov/usage.htm">terms
        and conditions of use</A>. See <A href="http://www.op.nysed.gov/help.htm">HELP
        glossary</A> for further explanations of terms used on this page. <BR>
    <BR> <B>Note: </B> The Board of Regents does not discipline <i>physicians(medicine),
        physician assistants,</i> or <i>specialist assistants.</i> The status of
    individuals in these professions may be impacted by information
    provided by the NYS Department of Health. To search for the latest
    discipline actions against individuals in these professions, please
    check the New York State Department of Health's <A
        href="http://www.health.state.ny.us/nysdoh/opmc/main.htm"> Office
        of Professional Medical Conduct</A> homepage.
    </UL>
</div>
<HR>
<div class="note">
    Further information on physicians may be found on the following
    external sites (The State Education Department is not responsible for
    the accuracy or completeness of information located on external
    Internet addresses.): <BR> <BR> <a
        href="http://www.abms.org/">American Board of Medical Specialties</a>
    <BR> <BR> <a href="http://www.ama-assn.org/">American
        Medical Association:</a> <BR> - For the general public: <a
        href="http://www.ama-assn.org/aps/amahg.htm">AMA Physician
        Select, On-line Doctor Finder</a><BR>&nbsp;&nbsp;&nbsp; <BR> -
    For organizations that verify physician credentials: <a
        href="http://www.ama-assn.org/physdata/physrel/physrel.htm">AMA
        Physician Profiles</a> <BR> <BR> <a
        href="http://www.aoa-net.org/">American Osteopathic Association,
        AOA-Net</a> <BR> <BR> <a href="http://www.docboard.org/">Association
        of State Medical Board Executive Directors-(A.I.M."DOCFINDER")</a> <BR>
    <BR> <a href="http://www.nydoctorprofile.com/welcome.jsp">New
        York State Department of Health Physician Profiles</a><BR> <BR>The
    following sites provide additional information concerning the medical
    profession: <BR> <BR> <a href="http://www.clearhq.org/">CLEAR
        (Council on Licensure, Enforcement and Regulation)</a> <BR> <BR>
    <a href="http://www.fsmb.org/">Federation of State Medical Boards</a><BR>
    <BR>
</div>
<CENTER>
    <BR> <IMG SRC="http://www.op.nysed.gov/Sedseal.jpg" WIDTH="100"
        HEIGHT="101" ALT="Seal of the State Education Department"><BR>
    <BR>
</CENTER>
</div>

如何找到不属于任何元素的值,在这种情况下,它们是GLUCK MARTIN,MEDICINE,077798,05 / 05/56等。

3 个答案:

答案 0 :(得分:0)

使用jQuery很容易 - 组合不包含:

$("#content_column").not(":contains('GLUCK MARTIN')")

答案 1 :(得分:0)

请参阅this answer

$('#content_column').clone().children().remove().end().text()

以下是带有示例标记的fiddle

答案 2 :(得分:0)

在节点中,我建议使用DOM而不是像regex这样的scrape工作。 jsdom是一个很好的,它允许您从片段中构建DOM。从那里,您可以查询document.documentElement(在我的示例中,我将使用jquery)并提取未包含在标记中的任何直接文本节点。

// Count all of the text not in a tag
var jsdom = require("jsdom");

jsdom.env(
  "URL OR YOUR HTML STRING HERE",
  ["http://code.jquery.com/jquery.js"],
  function (errors, window) {
    var textNodes = window.$(window.document.documentElement)
        .find(":not(iframe)")
        .addBack()
        .contents()
        .filter(function() {
            return this.nodeType == 3;
        });
    //do something with textNodes
  }
);