从代码(html,css,javascript)和数据的混合中提取数据

时间:2015-02-14 16:04:05

标签: web-scraping screen-scraping text-parsing text-extraction data-extraction

我有一些嵌入数据的代码。这是一个样本:

<div class='clear' ></div>
    </div> <!-- findResultListing -->














    <div class='findResultListing ' id='result_listing_7_0' onclick='examMapManagerHandle.clickMarker(7,0);'>





    <a href='javascript:examMapManagerHandle.clickMarker(7,0);'>
        <img class='balloon' src='/system/themes/asp/img/gmarkerH.png' border='0' />
    </a>


        <div class='findResultInfo'>
                        <div class="nextStep">
                <a href="/system/modules/shibboleth/secure_find/shib_gateway.php?url=%2Fexams%2Fschedule.php%3Fnav%3Dexams%2Cstucourses%2Cexams%2Csched_exam%26amp%3Badd_locid%3D1672">
                    <img height="16" border="0" align="left" width="16" src="/system/themes/asp/img/schedule.png"/>Schedule&nbsp;Exam 
                </a>
            </div>

            <a href='javascript:examMapManagerHandle.clickMarker(7,0);' >


                    SJSU Testing

                    <img class='userType' border='0' src="/system/themes/asp/img/org.png" alt='Testing Site' title='Testing Site'/>




            </a>
            <br />


                                One Washington Square<br />

                                Industrial Studies Building 228<br />

                                San Jose, CA  95112<br />



                                Phone: (408) 924-5980<br />

                                Email: <span id="_smarty_mailto_span_2096382943_1423929156_8">&nbsp;</span>
            <noscript>To see email address, enable javascript</noscript>
            <script type="text/javascript">var mailto=document.getElementById("_smarty_mailto_span_2096382943_1423929156_8");            
               mailto.innerHTML='<a href="mailto:testing-office@sjsu.edu" >testing-office@sjsu.edu</a>';</script><br />




                    Fee for two hour exam: 

        $40.00      












                                <a class="helpBtn" onmouseover="asp_toolTip(this,' &lt;strong&gt;Fee Details:&lt;\/strong&gt; We charge $20 for the first hour and $10 for each half hour after... &lt;br /&gt;  &lt;strong&gt;Miscellaneous Fees:&lt;\/strong&gt; Test emailed in pdf/Word Doc., we will charge an administrative fee of $15 for 10 or more test pages &lt;br /&gt;  &lt;strong&gt;Parking Fee Details:&lt;\/strong&gt; Its $8.00 to park in the 10th St. garage on the corner of 9th &amp; E. San Fernando Sts.', 'findResultsToolTip', 'fit_west', 'map_results_pane');"></a>

            <br />



                            </div><!-- findResultInfo -->

我想从上面的代码中提取以下内容:

  

SJSU测试测试站点

     

一个华盛顿广场

     

工业研究大楼228

     

San Jose,CA 95112

     

电话:(408)924-5980

     

电子邮件:testing-office@sjsu.edu

     

两小时考试费用:40.00美元

有哪些方法可以自动从代码中提取这些数据?

1 个答案:

答案 0 :(得分:2)

使用Xpath我会使用这个表达式:

//*/text()