使用BeautifulSoup通过id选择html

时间:2017-11-28 20:58:29

标签: html python-3.x beautifulsoup

我正在尝试从网页中选择一部分html

我使用代码:

ALTER PROCEDURE [dbo].[spWIPMatl]
WITH RECOMPILE AS
BEGIN

    with NewValues as
    (
        SELECT DATA0006_1.RKEY
            , SUM(DATA0095_1.QUANTITY * DATA0017_1.STD_COST) AS Material_cost
            , SUM(0.35 * DATA0095_1.QUANTITY) AS Sold_cost
        FROM DATA0095 AS DATA0095_1 
        INNER JOIN DATA0017 AS DATA0017_1 ON DATA0095_1.INVT_PTR = DATA0017_1.RKEY 
        INNER JOIN DATA0067 AS DATA0067_1 ON DATA0095_1.SRCE_PTR = DATA0067_1.RKEY 
        RIGHT OUTER JOIN DATA0006 AS DATA0006_1 ON DATA0067_1.WO_PTR = DATA0006_1.RKEY
        WHERE DATA0017_1.P_M = 'P'
            and LEFT(data0017_1.INV_PART_NUMBER, 3) in ('25-', '85-')
            and DATA0095_1.TRAN_TP in (13, 14)
        GROUP BY DATA0006_1.WORK_ORDER_NUMBER
            , DATA0006_1.ROOT_PTR
            , DATA0006_1.RKEY
            , DATA0006_1.QUAN_SCH
            , DATA0006_1.QUAN_REJ
            , DATA0017_1.INV_PART_NUMBER

        union all

        SELECT DATA0006_1.RKEY
            , SUM(DATA0095_1.QUANTITY * DATA0017_1.STD_COST) AS Material_cost
            , SUM(0.8 * DATA0095_1.QUANTITY) AS Sold_cost
        FROM DATA0095 AS DATA0095_1 
        INNER JOIN DATA0017 AS DATA0017_1 ON DATA0095_1.INVT_PTR = DATA0017_1.RKEY 
        INNER JOIN DATA0067 AS DATA0067_1 ON DATA0095_1.SRCE_PTR = DATA0067_1.RKEY 
        RIGHT OUTER JOIN DATA0006 AS DATA0006_1 ON DATA0067_1.WO_PTR = DATA0006_1.RKEY
        WHERE DATA0017_1.P_M = 'P'
            and LEFT(data0017_1.INV_PART_NUMBER, 3) = '35-'
            and DATA0095_1.tran_tp in (13, 14)
        GROUP BY DATA0006_1.WORK_ORDER_NUMBER
            , DATA0006_1.ROOT_PTR
            , DATA0006_1.RKEY
            , DATA0006_1.QUAN_SCH
            , DATA0006_1.QUAN_REJ
            , DATA0017_1.INV_PART_NUMBER

        union all

        SELECT DATA0006_1.RKEY
            , SUM(DATA0095_1.QUANTITY * DATA0017_1.STD_COST) AS Material_cost
            , SUM(0 * DATA0095_1.QUANTITY) AS Sold_cost

        FROM DATA0095 DATA0095_1
        INNER JOIN DATA0017 DATA0017_1 ON DATA0095_1.INVT_PTR = DATA0017_1.RKEY
        INNER JOIN DATA0067 DATA0067_1 ON DATA0095_1.SRCE_PTR = DATA0067_1.RKEY
        RIGHT OUTER JOIN DATA0006 DATA0006_1 ON DATA0067_1.WO_PTR = DATA0006_1.RKEY
        WHERE data0017_1.P_M = 'P' 
            and LEFT(data0017_1.INV_PART_NUMBER, 3) not in ('35-', '85-', '25-') 
            and DATA0095_1.tran_tp in (13, 14)
        GROUP BY DATA0006_1.WORK_ORDER_NUMBER
            , DATA0006_1.ROOT_PTR
            , DATA0006_1.RKEY
            , DATA0006_1.QUAN_SCH
            , DATA0006_1.QUAN_REJ
            , DATA0017_1.INV_PART_NUMBER
    )

    UPDATE a SET WIPmatl = WIPmatl + nv.Material_cost
        , WIP_sold = WIP_sold + nv.Sold_cost 
    from tempWIPAeroV1 a
    join NewValues nv on nv.RKEY = a.RKEY

    truncate table WIPAeroV1 --truncate will be quicker because it only has to log page drops instead of every row.
    insert into WIPAeroV1 select * from tempWIPAeroV1 --you should always specify the columns in insert statements
END

得到这个:

import urllib.request, urllib.parse, urllib.error
from lxml import html
from bs4 import BeautifulSoup

html = urllib.request.urlopen("http://archiveofourown.org/works/1758087?view_full_work=true").read()
soup = BeautifulSoup(html, 'lxml')

for tag in soup.find_all(id="children"):
    print(tag.get('href'))

然后提取数字(2219289)和名称(eiirene)。

当我在url =“http://archiveofourown.org/works/225563?view_full_work=true”上运行相同的代码时,尝试选择此项:

<div id="children" class="children module">
  <h3 class="heading">Works inspired by this one:</h3>
  <ul>
    <li>
        <a href="/works/2219289">not an origin story</a> by <a rel="author" href="/users/eiirene/pseuds/eiirene">eiirene</a>
    </li>
  </ul>
</div>

返回'无'。如果我使用find_all,则返回一个空列表[]。

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

我将两个网址都包含在列表中并循环显示。我也使用了请求库以简化。

代码:

import requests, re
from bs4 import BeautifulSoup

urls = ["http://archiveofourown.org/works/1758087?view_full_work=true", "http://archiveofourown.org/works/225563?view_full_work=true"]
for url in urls:
    print("***********************")
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'html.parser')
    part = soup.find("div", id="children")
    soup = BeautifulSoup(str(part), 'html.parser')
    nums = soup.findAll('a', href=re.compile('/works/'))
    names = soup.findAll('a', href=re.compile('/users/'))
    for i in range(0, len(nums)):
        print(str(nums[i]['href']).split('/works/')[1] + " : " + names[i].text)

输出:

***********************
2219289 : eiirene
***********************
240745 : blanketforyourshock
341326 : RedOrchid
787850 : ladymac111