如何在使用beautifulsoup创建的列表中删除html标签?

时间:2019-04-12 19:48:37

标签: python html beautifulsoup

因此,我从Wikipedia页面上抓取了不同的标题,例如: https://en.wikipedia.org/wiki/Sun

我刮掉了所有的大众头条新闻

titles = soup.find_all('span', {"class":"mw-headline"})

现在我想列出标题并打印出来

print(list(titles))

我的结果是包含所有html代码的列表

[<span class="mw-headline" id="Name_and_etymology">Name and etymology</span>, <span class="mw-headline" id="General_characteristics">General characteristics</span>, <span class="mw-headline" id="Sunlight">Sunlight</span>, <span class="mw-headline" id="Composition">Composition</span>, <span class="mw-headline" id="Singly_ionized_iron-group_elements">Singly ionized iron-group elements</span>, <span class="mw-headline" id="Isotopic_composition">Isotopic composition</span>, <span class="mw-headline" id="Structure_and_fusion">Structure and fusion</span>, <span class="mw-headline" id="Core">Core</span>, <span class="mw-headline" id="Radiative_zone">Radiative zone</span>, <span class="mw-headline" id="Tachocline">Tachocline</span>, <span class="mw-headline" id="Convective_zone">Convective zone</span>, <span class="mw-headline" id="Photosphere">Photosphere</span>, <span class="mw-headline" id="Atmosphere">Atmosphere</span>, <span class="mw-headline" id="Photons_and_neutrinos">Photons and neutrinos</span>, <span class="mw-headline" id="Magnetic_activity">Magnetic activity</span>, <span class="mw-headline" id="Magnetic_field">Magnetic field</span>, <span class="mw-headline" id="Variation_in_activity">Variation in activity</span>, <span class="mw-headline" id="Long-term_change">Long-term change</span>, <span class="mw-headline" id="Life_phases">Life phases</span>, <span class="mw-headline" id="Formation">Formation</span>, <span class="mw-headline" id="Main_sequence">Main sequence</span>, <span class="mw-headline" id="After_core_hydrogen_exhaustion">After core hydrogen exhaustion</span>, <span class="mw-headline" id="Orbit_and_location">Orbit and location</span>, <span class="mw-headline" id="Orbit_in_Milky_Way">Orbit in Milky Way</span>, <span class="mw-headline" id="Theoretical_problems">Theoretical problems</span>, <span class="mw-headline" id="Coronal_heating_problem">Coronal heating problem</span>, <span class="mw-headline" id="Faint_young_Sun_problem">Faint young Sun problem</span>, <span class="mw-headline" id="Observational_history">Observational history</span>, <span class="mw-headline" id="Early_understanding">Early understanding</span>, <span class="mw-headline" id="Development_of_scientific_understanding">Development of scientific understanding</span>, <span class="mw-headline" id="Solar_space_missions">Solar space missions</span>, <span class="mw-headline" id="Observation_and_effects">Observation and effects</span>, <span class="mw-headline" id="Planetary_system">Planetary system</span>, <span class="mw-headline" id="Religious_aspects">Religious aspects</span>, <span class="mw-headline" id="See_also">See also</span>, <span class="mw-headline" id="Notes">Notes</span>, <span class="mw-headline" id="References">References</span>, <span class="mw-headline" id="Further_reading">Further reading</span>, <span class="mw-headline" id="External_links">External links</span>]

如何删除标签,以便仅列出所有标题?

1 个答案:

答案 0 :(得分:0)

您无需遍历titles即可将其转换为列表,而可以遍历它们并使用标签上的text属性获取文本元素:

titles = [tag.text for tag in soup.find_all('span', {"class":"mw-headline"})]