PYTHON美丽的汤网络刮痧处理动态值

时间:2017-08-08 06:52:08

标签: python web-scraping beautifulsoup

以下HTML代码包含不同个别系列的动态属性。例如,一个系列可以有多个单位,如百万或千万。



    <tr class="series-pager-title">

<td valign="top" colspan="2">
    <div class="col-xs-12 col-sm-10">
            <a href="/series/TOTALSA" style="font-size:1.2em" class="series-title">Total Vehicle Sales</a>
        </div>
    <div class="hidden-xs col-sm-2">
     <span style="padding-left:49px;" class="popularity_bar">&nbsp;</span>     <span class="popularity_bar_background">&nbsp;</span>
    </div>
</td>
</tr>

<tr class="series-pager-attr">

<td colspan="2">
      <div class="series-meta series-group-meta">
  <span class="attributes">Monthly</span>
  <br class="clear">
  </div>
  <div class="series-meta">
    
      <input class="pager-item-checkbox" type="checkbox" name="sids[0]" value="TOTALSA">
    
        <a href="/series/TOTALSA">
            Millions of Units,       
  
        Seasonally Adjusted Annual Rate 
      
  </a>
    <span class="series-meta-dates">
    Jan 1976
   to 
    Jul 2017
    
    (4 days ago)
    </span>
    <br class="clear">
  
            
    
      <input class="pager-item-checkbox" type="checkbox" name="sids[1]" value="TOTALNSA">
    
        <a href="/series/TOTALNSA">
            Thousands of Units,       
  
        Not Seasonally Adjusted 
      
  </a>
    <span class="series-meta-dates">
    Jan 1976
   to 
    Jul 2017
    
    (4 days ago)
    </span>
    
  </div>
</td>
</tr>
<tr><td colspan="2" style="font-size:9px">&nbsp;</td></tr>

            
    <tr class="series-pager-title">

<td valign="top" colspan="2">
    <div class="col-xs-12 col-sm-10">
            <a href="/series/ALTSALES" style="font-size:1.2em" class="series-title">Light Weight Vehicle Sales: Autos and Light Trucks</a>
        </div>
    <div class="hidden-xs col-sm-2">
     <span style="padding-left:46px;" class="popularity_bar">&nbsp;</span>     <span class="popularity_bar_background">&nbsp;</span>
    </div>
</td>
</tr>

<tr class="series-pager-attr">

<td colspan="2">
      <div class="series-meta series-group-single">
    <input class="pager-item-checkbox" type="checkbox" name="sids[2]" value="ALTSALES">
   
  <span class="attributes" style="width:350px;">Millions of Units, Monthly, Seasonally Adjusted Annual Rate</span><span class="series-meta-dates">Jan 1976 to Jul 2017 (4 days ago)</span>
  <br class="clear">
  </div>
    
          <a href="/series/ALTSALES">
      
  
  
  </a>
      
</td>
&#13;
&#13;
&#13;

这让我有点接近,但它无法获得&#34;总车辆销售的第二频率,&#34;它只获得第一个“百万单位,季节性调整的年率”。&#34;除了这个问题之外,我的假设是我会用我当前的查询对事物进行错误的分类。到目前为止我创建的代码:

    browser=webdriver.Chrome(executable_path='F:\Anaconda\chromedriver\chromedriver_win32\chromedriver.exe')
    browser.get('https://fred.stlouisfed.org/categories/32993')
    soup=BeautifulSoup(browser.page_source,'lxml')

    for l in soup.find_all('tbody'):
        series_count=len(l.find_all('tr',attrs={'class':'series-pager-title'}))
        series_data=l.find_all('tr',attrs={'class':'series-pager-title'})
        attrs_data=l.find_all('tr',attrs={'class':'series-pager-attr'})
        print(series_count)
        print(len(attrs_data))
        for m in range(0,series_count):
            print(series_data[m].find('a',href=True).text+'  |  '+attrs_data[m].find('a',href=True).text.strip().replace('  ',' '))

在上述查询中,有人可以协助创建所需的结果:

enter image description here

1 个答案:

答案 0 :(得分:0)

如果有人遇到更好的解决方案,我会全神贯注......在此期间,这似乎可以解决问题......

browser.get('https://fred.stlouisfed.org/categories/32993')
soup=BeautifulSoup(browser.page_source,'lxml')

test=soup.tbody
children=[child for child in test if child != '\n']

series_data=pd.DataFrame([],columns=['series_index','series_title','series_href'])
sub_series_data=pd.DataFrame([],columns=['series_index','frequency','sub_series_units','sub_series_href'])
series_index=0
for index,child in enumerate(children):
    if child.find('a',attrs={'class':'series-title'}):
        series_index+=1
        series_title=child.text.strip()
        series_link=child.find('a',href=True).attrs['href']
        
        temp_series_df=({'series_index':series_index,
                         'series_title':series_title,
                         'series_href':series_link})
        series_data=series_data.append([temp_series_df],ignore_index=True)
                
    if child.find('div',attrs={'class':'series-meta'}):
        frequency=child.find('span',attrs={'class':'attributes'}).text.strip()
      
        for i in child.find_all('a',href=True):  
            temp_sub_series_df=({'series_index':series_index,
                                 'frequency':frequency.strip(),
                                 #'sub_series_units':i.text.strip(),
                                 'sub_series_units':re.sub(' +',' ',re.sub('\n',' ',i.text)),
                                 'sub_series_href':'https://fred.stlouisfed.org'+i.attrs['href']})
            sub_series_data=sub_series_data.append([temp_sub_series_df],ignore_index=True)

print(series_data)
print(sub_series_data)

combine_series_data=pd.merge(series_data,sub_series_data,how='left',on=['series_index'])