以下HTML代码包含不同个别系列的动态属性。例如,一个系列可以有多个单位,如百万或千万。
<tr class="series-pager-title">
<td valign="top" colspan="2">
<div class="col-xs-12 col-sm-10">
<a href="/series/TOTALSA" style="font-size:1.2em" class="series-title">Total Vehicle Sales</a>
</div>
<div class="hidden-xs col-sm-2">
<span style="padding-left:49px;" class="popularity_bar"> </span> <span class="popularity_bar_background"> </span>
</div>
</td>
</tr>
<tr class="series-pager-attr">
<td colspan="2">
<div class="series-meta series-group-meta">
<span class="attributes">Monthly</span>
<br class="clear">
</div>
<div class="series-meta">
<input class="pager-item-checkbox" type="checkbox" name="sids[0]" value="TOTALSA">
<a href="/series/TOTALSA">
Millions of Units,
Seasonally Adjusted Annual Rate
</a>
<span class="series-meta-dates">
Jan 1976
to
Jul 2017
(4 days ago)
</span>
<br class="clear">
<input class="pager-item-checkbox" type="checkbox" name="sids[1]" value="TOTALNSA">
<a href="/series/TOTALNSA">
Thousands of Units,
Not Seasonally Adjusted
</a>
<span class="series-meta-dates">
Jan 1976
to
Jul 2017
(4 days ago)
</span>
</div>
</td>
</tr>
<tr><td colspan="2" style="font-size:9px"> </td></tr>
<tr class="series-pager-title">
<td valign="top" colspan="2">
<div class="col-xs-12 col-sm-10">
<a href="/series/ALTSALES" style="font-size:1.2em" class="series-title">Light Weight Vehicle Sales: Autos and Light Trucks</a>
</div>
<div class="hidden-xs col-sm-2">
<span style="padding-left:46px;" class="popularity_bar"> </span> <span class="popularity_bar_background"> </span>
</div>
</td>
</tr>
<tr class="series-pager-attr">
<td colspan="2">
<div class="series-meta series-group-single">
<input class="pager-item-checkbox" type="checkbox" name="sids[2]" value="ALTSALES">
<span class="attributes" style="width:350px;">Millions of Units, Monthly, Seasonally Adjusted Annual Rate</span><span class="series-meta-dates">Jan 1976 to Jul 2017 (4 days ago)</span>
<br class="clear">
</div>
<a href="/series/ALTSALES">
</a>
</td>
&#13;
这让我有点接近,但它无法获得&#34;总车辆销售的第二频率,&#34;它只获得第一个“百万单位,季节性调整的年率”。&#34;除了这个问题之外,我的假设是我会用我当前的查询对事物进行错误的分类。到目前为止我创建的代码:
browser=webdriver.Chrome(executable_path='F:\Anaconda\chromedriver\chromedriver_win32\chromedriver.exe')
browser.get('https://fred.stlouisfed.org/categories/32993')
soup=BeautifulSoup(browser.page_source,'lxml')
for l in soup.find_all('tbody'):
series_count=len(l.find_all('tr',attrs={'class':'series-pager-title'}))
series_data=l.find_all('tr',attrs={'class':'series-pager-title'})
attrs_data=l.find_all('tr',attrs={'class':'series-pager-attr'})
print(series_count)
print(len(attrs_data))
for m in range(0,series_count):
print(series_data[m].find('a',href=True).text+' | '+attrs_data[m].find('a',href=True).text.strip().replace(' ',' '))
在上述查询中,有人可以协助创建所需的结果:
答案 0 :(得分:0)
如果有人遇到更好的解决方案,我会全神贯注......在此期间,这似乎可以解决问题......
browser.get('https://fred.stlouisfed.org/categories/32993')
soup=BeautifulSoup(browser.page_source,'lxml')
test=soup.tbody
children=[child for child in test if child != '\n']
series_data=pd.DataFrame([],columns=['series_index','series_title','series_href'])
sub_series_data=pd.DataFrame([],columns=['series_index','frequency','sub_series_units','sub_series_href'])
series_index=0
for index,child in enumerate(children):
if child.find('a',attrs={'class':'series-title'}):
series_index+=1
series_title=child.text.strip()
series_link=child.find('a',href=True).attrs['href']
temp_series_df=({'series_index':series_index,
'series_title':series_title,
'series_href':series_link})
series_data=series_data.append([temp_series_df],ignore_index=True)
if child.find('div',attrs={'class':'series-meta'}):
frequency=child.find('span',attrs={'class':'attributes'}).text.strip()
for i in child.find_all('a',href=True):
temp_sub_series_df=({'series_index':series_index,
'frequency':frequency.strip(),
#'sub_series_units':i.text.strip(),
'sub_series_units':re.sub(' +',' ',re.sub('\n',' ',i.text)),
'sub_series_href':'https://fred.stlouisfed.org'+i.attrs['href']})
sub_series_data=sub_series_data.append([temp_sub_series_df],ignore_index=True)
print(series_data)
print(sub_series_data)
combine_series_data=pd.merge(series_data,sub_series_data,how='left',on=['series_index'])