使用python从HTML中查找元素

时间:2018-05-07 15:14:32

标签: python beautifulsoup

我需要提取html代码的这些值:

2018-04-01
1,500,552
7,211
3,710

我曾经使用过find_all但我的问题是在这个HTML中我不知道如何找到元素

这是我的代码:

from bs4 import BeautifulSoup as Soup
import requests

print 'Fecha Inicio ej:2018-04-01'
start = raw_input()
print 'Fecha Fin ej:2018-04-01'
end = raw_input()
glob2 = []

urls = ['http://url.com/rtbpartners/report.php?partner=id&date_from={}&date_to={}&interval=daily'.format(start, end)]
for item in urls:
    data = requests.get(item)
    data = data.text
    print data
    soup = Soup(data, "html.parser")
    print soup.find_all('tr')

HTML示例:

<!DOCTYPE html>
<html>
<head>
<link rel="icon" href="../images/favicon.ico" type="image/x-icon">
<title>AdMedia Online Ad Network | Affiliate Advertising Solutions</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- Bootstrap -->
<link href="../css/admedia_styles.css" rel="stylesheet" media="screen">
<link href="../css/admedia_content_styles.css" rel="stylesheet" media="screen">
<link href="../css/chosen.css" rel="stylesheet" media="screen">

<style type="text/css">
<!--
.style1 {color: #00FF00}
-->
</style>
<link rel="stylesheet" href="http://code.jquery.com/ui/1.10.2/themes/smoothness/jquery-ui.css" />
<script src="http://code.jquery.com/jquery-1.9.1.js"></script>
<script src="http://code.jquery.com/ui/1.10.2/jquery-ui.js"></script>
<script language="javascript">
$(function() {
    //restricting min date to 2012-09-25 to use the new report table
    $( "#datepicker1" ).datepicker({ dateFormat: 'yy-mm-dd', minDate: (new Date(2012, 09-1, 25)) });
    $( "#datepicker2" ).datepicker({ dateFormat: 'yy-mm-dd', minDate: (new Date(2012, 09-1, 25)) });
});

</script>

<!--[if IE]>
<script type="text/javascript">
document.createElement("article");
document.createElement("nav");
document.createElement("section");
document.createElement("header");
document.createElement("aside");
document.createElement("figure");
document.createElement("legend");
document.createElement("footer");
</script>
<![endif] -->

</head>

<body  >


<a name="top"></a>

  <header id="main-header">
    <div class="container">
      <a href="http://admedia.com" class="admedia-logo-wrapper"><span class="admedia-logo"><span class="admedia-logo-text">a</span><span class="admedia-logo-dot">d</span></span></a>
      <!--
      <ul class="top-right-links clearfix">
        <li class="first"><a class="call-link" href="tel:18002967104"><span class="admedia-icon icon-phone" aria-hidden="true"></span><span class="text">Call: (800) 296-7104</span></a></li>
        <li class="hidden-phone">&nbsp;&nbsp;|&nbsp;&nbsp;</li>
        <li>
          <a href="/contact-us/" class="contact-link"><span class="admedia-icon icon-bubbles" aria-hidden="true"></span><span class="text">Contact Us</span></a>
          <ul class="main-sub-navigation">
            <li><a href="/contact-us/">Contact</a></li>
            <li><a href="/contact-us/support_ticket/">Support Ticket</a></li>
            <li><a href="http://help.admedia.com">Help Center</a></li>
          </ul>
        </li>
      </ul>-->


      <a id="main-navigation-dropdown-toggle" href="#">
        <span class="icon-navigation" aria-hidden="true"></span>
      </a>

      <!--<div class=" scroll-hint scroll-hint-main"></div>-->

      <a href="#" class="scroll-hint main-scroll-hint scroll-hint-main-top-arrow"><span class="scroll-hint-icon icon-chevron-sign-up" aria-hidden="true"></span></a>  
      <a href="#" class="scroll-hint main-scroll-hint scroll-hint-main-bottom-arrow"><span class="scroll-hint-icon icon-chevron-sign-down" aria-hidden="true"></span></a>


    </div>
  </header>
  <div style="margin-top: 35px; margin-left: 20px;">
    <h2>RTB DSP Stats</h2>
    <br>
    <form name="stats" method="get" action="/rtbpartners/report.php">
        <input type="hidden" name="partner" value="empresa">
        <input type="hidden" name="key" value="key">
        <table border='0' cellpadding='15' cellspacing='10'>
        <tr>
            <td>Date: </td>
            <td><input type="text" style="width:80px" name="date_from" id="datepicker1" value="2018-04-01"> to <input type="text" style="width:80px" name="date_to" id="datepicker2" value="2018-04-01">&nbsp;&nbsp;&nbsp;</td>
        </tr>
        <tr>
            <td>Select Interval: </td>
            <td>
                <select name="interval">
                    <option value="daily" selected>Daily</option>
                    <option value="hourly" >Hourly</option>
                </select>
            </td>
        </tr>
        <tr>
            <td colspan="2"><input type="submit" value="Update"></td>
        </tr>
        </table>        
    </form>
    <br><br>
            <table width="100%" class="sortable">
        <thead>
            <tr>
              <th style="padding-left: 5px; padding-right: 5px; border-bottom: 1px solid #CDCDCD; background-color: #A9C0C2;" align="left">
                <b>Date</b>
              </th>

              <th style="padding-left: 5px; padding-right: 5px; border-bottom: 1px solid #CDCDCD; background-color: #A9C0C2;" align="left">
                <b>Requests</b>
              </th>
              <th style="padding-left: 5px; padding-right: 5px; border-bottom: 1px solid #CDCDCD; background-color: #A9C0C2;" align="left">
                <b>Responses</b>
              </th>
              <th style="padding-left: 5px; padding-right: 5px; border-bottom: 1px solid #CDCDCD; background-color: #A9C0C2;" align="left">
                <b>Impressions</b>
              </th>
              <th style="padding-left: 5px; padding-right: 5px; border-bottom: 1px solid #CDCDCD; background-color: #A9C0C2;" align="left">
                <b>Spend</b>
              </th>
            </tr>
        </thead>
        <tbody>
                    <tr align=center>
              <td style="padding: 2px; border-bottom: 1px solid #CDCDCD; background-color: #;" align="left">
              2018-04-01              </td>
              <td style="padding: 2px; border-bottom: 1px solid #CDCDCD; background-color: #;" align="left">
                1,500,552             </td>
              <td style="padding: 2px; border-bottom: 1px solid #CDCDCD; background-color: #;" align="left">
                7,211             </td>
              <td style="padding: 2px; border-bottom: 1px solid #CDCDCD; background-color: #;" align="left">
                3,710             </td>
              <td style="padding: 2px; border-bottom: 1px solid #CDCDCD; background-color: #;" align="left">
                1.43              </td>

            </tr>
                    </tbody>
            <tfoot>
            <tr>
                <td style="padding-left: 5px; padding-right: 20px; border-bottom: 1px solid #CDCDCD; background-color: #A9C0C2;"
                    align="right">
                    <b>Total:</b>
                </td>
                <td style="padding-left: 5px; padding-right: 5px; border-bottom: 1px solid #CDCDCD; background-color: #A9C0C2;"
                    align="left">
                    <b>1,500,552</b>
                </td>
                <td style="padding-left: 5px; padding-right: 5px; border-bottom: 1px solid #CDCDCD; background-color: #A9C0C2;"
                    align="left">
                    <b>7,211</b>
                </td>
                <td style="padding-left: 5px; padding-right: 5px; border-bottom: 1px solid #CDCDCD; background-color: #A9C0C2;"
                    align="left">
                    <b>3,710</b>
                </td>
                <td style="padding-left: 5px; padding-right: 5px; border-bottom: 1px solid #CDCDCD; background-color: #A9C0C2;"
                    align="left">
                    <b>1.43</b>
                </td>
            </tr>
            </tfoot>
        </table>
      </div>

</body>
</html>

2 个答案:

答案 0 :(得分:0)

您永远不会在for循环之前检索任何元素,因此循环无需查找。我建议在for循环之前放置你的“find_all()”然后执行它。然后添加更多for循环来遍历所有标记以找到要查找的特定标记。 包括一些if循环,如

if tag.name == "td":
    (code here)

我还建议您查看lxml用于使用xpath在网页上查找特定项目。

答案 1 :(得分:0)

这样的事情应该有效:

soup = bs4.BeautifulSoup(content, 'lxml')

for table_row in soup.find_all(name="tr"):
    if table_row.parent.name == "tbody":
        for content in table_row.find_all("td"):
            print(content.getText().strip())

它在Python 3中使用了BeautifoulSoup4,但是你在使用Python2时没有任何困难。

结果:

2018-04-01
1,500,552
7,211
3,710
1.43