如何纠正我的下面脚本以获得并排输出?

时间:2018-01-05 17:27:18

标签: python python-3.x web-scraping beautifulsoup

我已经在python中编写了一个脚本来从网页上收集一些信息。我以非常紧凑的方式使用css selector编写了它。我的脚本能够获取数据。但是,我遇到的问题是我无法在我的脚本中使用css selector并不是连续地并排查看结果,因为我使用逗号分隔{{1}一次获取两种类型的值。如果我无法提供清晰度,请参阅以下示例。

我尝试的脚本:

css selector

输出我有:

import requests
from bs4 import BeautifulSoup

res = requests.get("https://www.drugbank.ca/drugs/DB04789")
soup = BeautifulSoup(res.text ,"lxml")
items = '\n'.join([item.text for item in soup.select("dl > dt , dl > dd")])
print(items)

我希望输出:

Name
Accession Number
Type

5-methyltetrahydrofolic acid
DB04789
Small Molecule 

是否有可能在选择器中应用一些微小的变化来获得预期的输出,并将其保持在单行中,如上所述。谢谢你看看它。

3 个答案:

答案 0 :(得分:1)

单独<dt>all_dt}和<dd>all_dd) 并使用zip(all_dt, all_dd)创建对。

import requests
from bs4 import BeautifulSoup

res = requests.get("https://www.drugbank.ca/drugs/DB04789")
soup = BeautifulSoup(res.text ,"lxml")

all_dt = soup.select("dl > dt")
all_dd = soup.select("dl > dd")

for dt, dd in zip(all_dt, all_dd):
    print(dt.text, ":", dd.text)

您还可以使用nextSibling获取dt

之后的元素
all_dt = soup.select("dl > dt")

for dt in all_dt:
    dd = dt.nextSibling
    print(dt.text, ":", dd.text) 

我在2小时前回答问题Deep parse with beautifulsoup的完整代码。

import requests
from bs4 import BeautifulSoup

def get_details(url):
    print('details:', url)

    # get subpage
    r = requests.get(url)
    soup = BeautifulSoup(r.text ,"lxml")

    # get data on subpabe
    dts = soup.findAll('dt')
    dds = soup.findAll('dd')

    # display details
    for dt, dd in zip(dts, dds):
        print(dt.text)
        print(dd.text)
        print('---')

    print('---------------------------')

def drug_data():
    url = 'https://www.drugbank.ca/drugs/'

    while url:
        print(url)
        r = requests.get(url)
        soup = BeautifulSoup(r.text ,"lxml")

        # get links to subpages
        links = soup.select('strong a')
        for link in links:
            # exeecute function to get subpage
            get_details('https://www.drugbank.ca' + link['href'])

        # next page url
        url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
        print(url)
        if url:
            url = 'https://www.drugbank.ca' + url[0].get('href')
        else:
            break

drug_data()

答案 1 :(得分:1)

如果分别获得两种类型的数据,可以将它们压缩在一起然后打印出来:

import requests
from bs4 import BeautifulSoup

res = requests.get("https://www.drugbank.ca/drugs/DB04789")
soup = BeautifulSoup(res.text ,"lxml")
categories = soup.select("dl > dt")
entries = soup.select("dl > dd")
items = zip(categories, entries)

for item in items:
    print(item[0].text + ": " + item[1].text)

答案 2 :(得分:0)

这或多或少是我期待的答案版本:

import requests
from bs4 import BeautifulSoup

res = requests.get("https://www.drugbank.ca/drugs/DB04789")
soup = BeautifulSoup(res.text ,"lxml")
items = [': '.join([item.text,item.find_next().text]) for item in soup.select("dl > dt")]
print(items)