使用BeautifulSoup的网页抓取问题

时间:2019-03-04 16:48:09

标签: python web-scraping beautifulsoup

Web scraping issue (screen shot attached)

def get_text(value):
tdlist = []
for i in soup.findAll(value): # Reduce data to those with html tag 
    if i.text != "":
        text = i.text
        text = text.strip()
        if '\n' not in text: # Remove unnecessary data
            tdlist.append(text)
return tdlist

Master_df = pd.DataFrame()
logs = []

hh = 0
for tag in df_F['Value']:  

    print(hh)
    hh =  hh + 1

    try:
        url = 'https://www.ayurveda.com' + tag

        #weblink to scrape
        html = urlopen(url)
        y = html.read()

        # Page title is:  Scraping 
        soup = BeautifulSoup(y, 'html.parser') # Parse resulting source

        c_list = []
        Title = []


        for value in ['p']:
            c_list = get_text(value)

        for tes in soup.findAll('h1'):
            Title = tes.text

        com_list = c_list
        com_list = '. '.join(com_list)
        com_list = com_list.replace('..',". ")

        com_list1 = Title

        df_each = pd.DataFrame(columns = ["URL","Title","Content","Category","Website"],index = range(0,1))

       df_each["URL"] = url
       df_each["Content"] = com_list
       df_each["Title"] = com_list1
       df_each["Category"] = 'Ayurveda'
       df_each["Website"] = 'Ayurveda'

       Master_df = Master_df.append(df_each)
   except Exception as e:
       print("Hey!, check this :",str(e))
       logs.append(str(e))

[正在尝试下载网站中的内容。这是从网站上下载的2条重要信息。

1)列中的标题(以“ title”标记)-很清楚。我得到正确的信息 2)另一列中的内容(标记为“ p”)-我在获取这些信息时遇到问题

以下是来自网站的信息:

我可以刮下一行(用粗体和斜体标出)

“由Vasant Lad,BAM&S,MASc撰写”

下面我无法抓取(标有斜体)

阿育吠陀被许多学者认为是最古老的康复科学。在梵文中,阿育吠陀的意思是“生命的科学”。阿育吠陀的知识起源于5000多年前的印度,通常被称为“万灵之母”。它起源于古老的吠陀文化,并在数千年的历史中被传授。从熟练的大师到门徒的口头传统。这些知识中的一些定于几千年前印制,但其中许多是不可访问的。西方现在熟悉的许多自然愈合系统的原理都源于阿育吠陀,包括顺势疗法和极性疗法。

。] 2

1 个答案:

答案 0 :(得分:0)

您没有获得该段落的原因是因为这里有此行:

if '\n' not in text:

您想要的段落:

'Ayurveda is considered by many scholars to be the oldest healing science. In Sanskrit, Ayurveda means “The Science of Life.” Ayurvedic knowledge originated\n    in India more than 5,000 years ago and is often called the “Mother of All Healing.” It stems from the ancient Vedic culture and was taught for many\n    thousands of years in an oral tradition from accomplished masters to their disciples. Some of this knowledge was set to print a few thousand years\n    ago, but much of it is inaccessible. The principles of many of the natural healing systems now familiar in the West have their roots in Ayurveda, including\n    Homeopathy and Polarity Therapy.'

具有\n,因此不会将该文本附加到您的tdlist。使用.strip()时,它将仅删除字符串开头和结尾的那些新行和空白。因此,您需要找到其他条件。

因此,您只需添加一个附加条件即可捕获标记<p class="bitter">之后的特定内容

我假设所有链接都遵循该格式。

更改功能:

def get_text(value):
    tdlist = []
    for i in soup.findAll(value): # Reduce data to those with html tag 
        if i.text != "":
            text = i.text
            text = text.strip()
            if '\n' not in text or i.find_previous(value).attrs == {'class': ['bitter']}: # Remove unnecessary data
                tdlist.append(text)
    return tdlist 

输出:

print (c_list)
['by Vasant Lad, BAM&S, MASc', 'Ayurveda is considered by many scholars to be the oldest healing science. In Sanskrit, Ayurveda means “The Science of Life.” Ayurvedic knowledge originated\n    in India more than 5,000 years ago and is often called the “Mother of All Healing.” It stems from the ancient Vedic culture and was taught for many\n    thousands of years in an oral tradition from accomplished masters to their disciples. Some of this knowledge was set to print a few thousand years\n    ago, but much of it is inaccessible. The principles of many of the natural healing systems now familiar in the West have their roots in Ayurveda, including\n    Homeopathy and Polarity Therapy.', 'Copyright © 2006, Vasant Lad, MASc, and The Ayurvedic Institute. All Rights Reserved.', 'Copyright © 2006, Vasant Lad, MASc, and The Ayurvedic Institute. All Rights Reserved.']