如何通过Python从以下HTML中提取标签

时间:2018-03-11 07:46:31

标签: python regex

我创建了一个正则表达式来搜索标签,如下所示:

<a href=\".+\" rel=\"nofollow\"><strong>دانلود</strong></a>

但结果我只得到一个包含其他HTML标签的庞大结果。

我的HTML是:

   <div class="download-51803-links">
<h3>لینک های دانلود</h3>
<span class="instruction-expander">راهنمای دانلود</span>
<script type="text/javascript">
  link=('report/' + 'pop-up.php')   
  document.write('<a class="dbox cboxElement" target="_blank" rel="nofollow" href="http://p30download.com/' + link + '?report-id=77722&report-bid=18&report-title=دانلود Machine Learning A Z Hands-On Python & R In Data Science آموزش کامل یادگیری ماشین آشنایی با پایتون و آر در علوم داده" style="padding:0px" ><span class="report-link">گزارش خرابی</span></a>')
</script>
<p dir="rtl"><img alt="اطلاعات" class="image-text-top" src="http://p30download.com/template/icons/set3/exclaim.gif" title="اطلاعات"/> <strong>حجم</strong>: 5.06 گیگابایت<br><img alt="دانلود" class="image-text-top" src="http://p30download.com/template/icons/set3/arrow-down.gif" title="دانلود"/> <a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part1.rar" rel="nofollow"><strong>دانلود</strong></a> - بخش اول<br><img alt="دانلود" class="image-text-top" src="http://p30download.com/template/icons/set3/arrow-down.gif" title="دانلود"/> <a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part2.rar" rel="nofollow"><strong>دانلود</strong></a> - بخش دوم<br><img alt="دانلود" class="image-text-top" src="http://p30download.com/template/icons/set3/arrow-down.gif" title="دانلود"/> <a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part3.rar" rel="nofollow"><strong>دانلود</strong></a> - بخش سوم<br><img alt="دانلود" class="image-text-top" src="http://p30download.com/template/icons/set3/arrow-down.gif" title="دانلود"/> <a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part4.rar" rel="nofollow"><strong>دانلود</strong></a> - بخش چهارم<br><img alt="دانلود" class="image-text-top" src="http://p30download.com/template/icons/set3/arrow-down.gif" title="دانلود"/> <a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part5.rar" rel="nofollow"><strong>دانلود</strong></a> - بخش پنجم<br><img alt="دانلود" class="image-text-top" src="http://p30download.com/template/icons/set3/arrow-down.gif" title="دانلود"/> <a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part6.rar" rel="nofollow"><strong>دانلود</strong></a> - بخش ششم</br></br></br></br></br></br></p>
</div>

如何将4个项目提取为a标记,例如?

<a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part1.rar" rel="nofollow"><strong>دانلود</strong></a>

1 个答案:

答案 0 :(得分:0)

这是使用Beautiful Soup的解决方案..

html =  """  <div class="download-51803-links">
<h3>لینک های دانلود</h3>
<span class="instruction-expander">راهنمای دانلود</span>
<script type="text/javascript">
  link=('report/' + 'pop-up.php')   
  document.write('<a class="dbox cboxElement" target="_blank" rel="nofollow" href="http://p30download.com/' + link + '?report-id=77722&report-bid=18&report-title=دانلود Machine Learning A Z Hands-On Python & R In Data Science آموزش کامل یادگیری ماشین آشنایی با پایتون و آر در علوم داده" style="padding:0px" ><span class="report-link">گزارش خرابی</span></a>')
</script>
<p dir="rtl"><img alt="اطلاعات" class="image-text-top" src="http://p30download.com/template/icons/set3/exclaim.gif" title="اطلاعات"/> <strong>حجم</strong>: 5.06 گیگابایت<br><img alt="دانلود" class="image-text-top" src="http://p30download.com/template/icons/set3/arrow-down.gif" title="دانلود"/> <a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part1.rar" rel="nofollow"><strong>دانلود</strong></a> - بخش اول<br><img alt="دانلود" class="image-text-top" src="http://p30download.com/template/icons/set3/arrow-down.gif" title="دانلود"/> <a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part2.rar" rel="nofollow"><strong>دانلود</strong></a> - بخش دوم<br><img alt="دانلود" class="image-text-top" src="http://p30download.com/template/icons/set3/arrow-down.gif" title="دانلود"/> <a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part3.rar" rel="nofollow"><strong>دانلود</strong></a> - بخش سوم<br><img alt="دانلود" class="image-text-top" src="http://p30download.com/template/icons/set3/arrow-down.gif" title="دانلود"/> <a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part4.rar" rel="nofollow"><strong>دانلود</strong></a> - بخش چهارم<br><img alt="دانلود" class="image-text-top" src="http://p30download.com/template/icons/set3/arrow-down.gif" title="دانلود"/> <a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part5.rar" rel="nofollow"><strong>دانلود</strong></a> - بخش پنجم<br><img alt="دانلود" class="image-text-top" src="http://p30download.com/template/icons/set3/arrow-down.gif" title="دانلود"/> <a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part6.rar" rel="nofollow"><strong>دانلود</strong></a> - بخش ششم</br></br></br></br></br></br></p>
</div>"""


    from bs4 import BeautifulSoup
    import requests
    import re
    import random
    import types    

    soup = BeautifulSoup(html, 'html.parser')

    list_links = [] # Create empty list

    for a in soup.findAll(href=True): # find links
        list_links.append(a) #append to the list

    def return_links(list_, num):
        """ Takes in a list and returns n amount of items in a list """
        links_list = []

        for i in range(num):
            try:
                r = list_.pop(random.randint(0, len(list_)))
                links_list.append(r)
            except IndexError:
                return links_list

        return links_list

    list_of_links = return_links(list_links, 4)

    for i in list_of_links:
        print(i)

返回:

<a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part3.rar" rel="nofollow"><strong>دانلود</strong></a>
<a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part5.rar" rel="nofollow"><strong>دانلود</strong></a>
<a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part6.rar" rel="nofollow"><strong>دانلود</strong></a>
<a href="http://cdn.p30download.com/?b=p30dl-tutorial&amp;f=Udemy.Machine.Learning.A.Z..Hands.On.Python.and.R.In.Data.Science.Updated.1.2018_p30download.com.part1.rar" rel="nofollow"><strong>دانلود</strong></a>