由脚本生成的scrape表

时间:2018-05-27 12:29:59

标签: web-scraping beautifulsoup

我一直试图用蟒蛇和美丽的汤刮一个网站表。我遇到的问题是该表是通过脚本生成的,因此表格如下所示:

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<!DOCTYPE html>
    <html>
    <head>
    <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.1/jquery.min.js"></script>
    <meta charset=utf-8 />
    <title>JS Bin</title>
     </head>
    <body>
    <select id="ddlCategory">
                        <option>Choose</option>
                        <option value="Value1">Value1</option>
                        <option value="Value2">Value 2</option>
                        <option value="Others">Others</option>
                    </select>
    <table>
            <tr>
                <td>
                    <asp:Label ID="Label1" runat="server" Text="Category: " />
                </td>
                <td>
                    
                </td>
            </tr>
    
            <tr id="Other" style="display: none">
              <table>

              <tbody>
              <tr>
                <td>
                    <input id="txtOthers" type="text" runat="server" />
                </td>
              </tr>
                <tr>
                <td>
                    <input id="txtOthers" type="text" runat="server" />
                </td>
              </tr>
               <tr>
                <td>
                    <input id="txtOthers" type="text" runat="server" />
                </td>
              </tr>
          </tbody>      
       </table>

             
            </tr>
        </table>
    </body>
    </html>

我想知道是否有人知道刮桌子是否仍然可行。在表格前面有一个脚本标记,我想知道它是否有用。

<table class="table table-compact table-striped table-topics">
            <thead>
                <tr>
                    <th data-intro="Clicking a topic will allow you to view and ask general technical questions about the topic through SITIS." data-position="bottom">Topic #</th>
                    <th>Program</th>
                    <th>Component</th>
                    <th>Technology Area</th>
                    <th>Title</th>
                    <th data-intro="If there is SITIS activity for a topic a clickable 'QA' will appear in this column." data-position="bottom">SITIS</th>
                </tr>
            </thead>
            <tbody>
                {{#each this.Results}}
                <tr>
                    <td><a href="/topics?topicId={{this.TopicId}}" target="_blank" data-topicid="{{this.TopicId}}">{{this.TopicNumber}}</a></td>
                    <td>{{this.ProgramTypeName}}</td>
                    <td>{{this.AgencyName}}</td>
                    <td>

                      <div class="icons">
                        {{#if this.TechAreaAirPlatform}}
                          <i class="glyph-icon flaticon-air-platform" data-toggle="tooltip" title="Technology Area: Air Platform"></i>
                        {{/if}}
                        {{#if this.TechAreaChemBioDefense }}
                          <i class="glyph-icon flaticon-chem-bio-defense" data-toggle="tooltip" title="Technology Area: Chem Bio Defense"></i>
                        {{/if}}
                        {{#if this.TechAreaInfoSystems}}
                          <i class="glyph-icon flaticon-info-systems" data-toggle="tooltip" title="Technology Area: Info Systems"></i>
                        {{/if}}
                        {{#if this.TechAreaGroundSea }}
                          <i class="glyph-icon flaticon-ground-sea" data-toggle="tooltip" title="Technology Area: Ground Sea"></i>
                        {{/if}}
                        {{#if this.TechAreaMaterials}}
                          <i class="glyph-icon flaticon-materials" data-toggle="tooltip" title="Technology Area: Materials"></i>
                        {{/if}}
                        {{#if this.TechAreaBioMedical }}
                          <i class="glyph-icon flaticon-bio-med" data-toggle="tooltip" title="Technology Area: Bio Medical"></i>
                        {{/if}}
                        {{#if this.TechAreaSensors }}
                          <i class="glyph-icon flaticon-sensors" data-toggle="tooltip" title="Technology Area: Sensors"></i>
                        {{/if}}
                        {{#if this.TechAreaElectronics }}
                          <i class="glyph-icon flaticon-electronics" data-toggle="tooltip" title="Technology Area: Electronics"></i>
                        {{/if}}
                        {{#if this.TechAreaBattlespace }}
                          <i class="glyph-icon flaticon-battlespace" data-toggle="tooltip" title="Technology Area: Battlespace"></i>
                        {{/if}}
                        {{#if this.TechAreaSpacePlatforms }}
                          <i class="glyph-icon flaticon-space-platform" data-toggle="tooltip" title="Technology Area: Space Platforms"></i>
                        {{/if}}
                          {{#if this.TechAreaHumanSystems }}
                          <i class="glyph-icon flaticon-human-systems" data-toggle="tooltip" title="Technology Area: Human Systems"></i>
                        {{/if}}
                        {{#if this.TechAreaWeapons }} 
                          <i class="glyph-icon flaticon-weapons" data-toggle="tooltip" title="Technology Area: Weapons"></i>
                        {{/if}}
                        {{#if this.TechAreaNuclear }}
                          <i class="glyph-icon flaticon-nuclear" data-toggle="tooltip" title="Technology Area: Nuclear"></i>
                        {{/if}}
                      </div>
                    </td>
                    <td><a href="/topics?topicId={{this.TopicId}}" target="_blank" data-topicid="{{this.TopicId}}">{{this.TopicTitle}}</a></td>
                    <td>{{#if this.PublishedQuestionCount}}<a href="/topics?topicId={{this.TopicId}}" target="_blank" data-topicid="{{this.TopicId}}">Q&A</a>{{/if}}</td>
                </tr>
                {{else}}
                <tr>
                    <td colspan="6"><div class="alert alert-warning">No topics were found.</div></td>
                </tr>
                {{/each}}
            </tbody>
        </table>

提前谢谢你!

1 个答案:

答案 0 :(得分:0)

注释中关于使用Selenium WebDriver的建议可能是解决问题的最简单方法。看来您正在尝试抓取一个使用Django模板或类似模板动态生成内容的网站。

因此,您需要模拟一个浏览器以实际加载页面上的所有内容,因为您当前仅获取静态html。您可以使用软件包管理器安装selenium,然后需要为要模拟的浏览器安装驱动程序:

pip install selenium
pip install chromedriver

注意:并非所有的Web驱动程序都可以与包管理器一起安装(我不认为),因此您可能必须从常规Internet下载它。

现在您可以使用诸如我编写的此函数之类的东西来刮取所需的页面:

# purpose: a function which takes a url and extracts the contents as a string
# depends on selenium webdriver to turn js-scripts into html as well as time and os libraries
# signature: pull_html_page(url:string, write:optional boolean) -> string 
def pull_html_page(url, write = False):

    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(1)
    content = driver.page_source.encode('utf-8')

    driver.quit()


    if write == True:
        # the "my dick computer won't let me be root" workaround
        if os.geteuid() == 0:
            print("We're root!")
        else:
            print("We're not root.")
            CURRENT_SCRIPT = os.path.realpath(__file__)
            os.system('echo ' + PASSWORD_FOR_SUDO + '|sudo -S python '+ CURRENT_SCRIPT)

            clean = BeautifulSoup(content, "html.parser").prettify()

            f = open("out.html", "w+")
            f.write(clean)
            f.close()

    return content

如果此解决方案对您来说效率不够高,或者您只需要动态生成的数据,而静态html则不需要。您通常可以使用检查工具(我更喜欢chrome上的工具)来查看网络流量。有时,您可以看到返回JSON响应的URL,这样做可以节省您加载页面的时间,并且可以直接从响应URL中抓取数据。

祝你好运!