我正在尝试从网页中选择一部分html
我使用代码:
ALTER PROCEDURE [dbo].[spWIPMatl]
WITH RECOMPILE AS
BEGIN
with NewValues as
(
SELECT DATA0006_1.RKEY
, SUM(DATA0095_1.QUANTITY * DATA0017_1.STD_COST) AS Material_cost
, SUM(0.35 * DATA0095_1.QUANTITY) AS Sold_cost
FROM DATA0095 AS DATA0095_1
INNER JOIN DATA0017 AS DATA0017_1 ON DATA0095_1.INVT_PTR = DATA0017_1.RKEY
INNER JOIN DATA0067 AS DATA0067_1 ON DATA0095_1.SRCE_PTR = DATA0067_1.RKEY
RIGHT OUTER JOIN DATA0006 AS DATA0006_1 ON DATA0067_1.WO_PTR = DATA0006_1.RKEY
WHERE DATA0017_1.P_M = 'P'
and LEFT(data0017_1.INV_PART_NUMBER, 3) in ('25-', '85-')
and DATA0095_1.TRAN_TP in (13, 14)
GROUP BY DATA0006_1.WORK_ORDER_NUMBER
, DATA0006_1.ROOT_PTR
, DATA0006_1.RKEY
, DATA0006_1.QUAN_SCH
, DATA0006_1.QUAN_REJ
, DATA0017_1.INV_PART_NUMBER
union all
SELECT DATA0006_1.RKEY
, SUM(DATA0095_1.QUANTITY * DATA0017_1.STD_COST) AS Material_cost
, SUM(0.8 * DATA0095_1.QUANTITY) AS Sold_cost
FROM DATA0095 AS DATA0095_1
INNER JOIN DATA0017 AS DATA0017_1 ON DATA0095_1.INVT_PTR = DATA0017_1.RKEY
INNER JOIN DATA0067 AS DATA0067_1 ON DATA0095_1.SRCE_PTR = DATA0067_1.RKEY
RIGHT OUTER JOIN DATA0006 AS DATA0006_1 ON DATA0067_1.WO_PTR = DATA0006_1.RKEY
WHERE DATA0017_1.P_M = 'P'
and LEFT(data0017_1.INV_PART_NUMBER, 3) = '35-'
and DATA0095_1.tran_tp in (13, 14)
GROUP BY DATA0006_1.WORK_ORDER_NUMBER
, DATA0006_1.ROOT_PTR
, DATA0006_1.RKEY
, DATA0006_1.QUAN_SCH
, DATA0006_1.QUAN_REJ
, DATA0017_1.INV_PART_NUMBER
union all
SELECT DATA0006_1.RKEY
, SUM(DATA0095_1.QUANTITY * DATA0017_1.STD_COST) AS Material_cost
, SUM(0 * DATA0095_1.QUANTITY) AS Sold_cost
FROM DATA0095 DATA0095_1
INNER JOIN DATA0017 DATA0017_1 ON DATA0095_1.INVT_PTR = DATA0017_1.RKEY
INNER JOIN DATA0067 DATA0067_1 ON DATA0095_1.SRCE_PTR = DATA0067_1.RKEY
RIGHT OUTER JOIN DATA0006 DATA0006_1 ON DATA0067_1.WO_PTR = DATA0006_1.RKEY
WHERE data0017_1.P_M = 'P'
and LEFT(data0017_1.INV_PART_NUMBER, 3) not in ('35-', '85-', '25-')
and DATA0095_1.tran_tp in (13, 14)
GROUP BY DATA0006_1.WORK_ORDER_NUMBER
, DATA0006_1.ROOT_PTR
, DATA0006_1.RKEY
, DATA0006_1.QUAN_SCH
, DATA0006_1.QUAN_REJ
, DATA0017_1.INV_PART_NUMBER
)
UPDATE a SET WIPmatl = WIPmatl + nv.Material_cost
, WIP_sold = WIP_sold + nv.Sold_cost
from tempWIPAeroV1 a
join NewValues nv on nv.RKEY = a.RKEY
truncate table WIPAeroV1 --truncate will be quicker because it only has to log page drops instead of every row.
insert into WIPAeroV1 select * from tempWIPAeroV1 --you should always specify the columns in insert statements
END
得到这个:
import urllib.request, urllib.parse, urllib.error
from lxml import html
from bs4 import BeautifulSoup
html = urllib.request.urlopen("http://archiveofourown.org/works/1758087?view_full_work=true").read()
soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_all(id="children"):
print(tag.get('href'))
然后提取数字(2219289)和名称(eiirene)。
当我在url =“http://archiveofourown.org/works/225563?view_full_work=true”上运行相同的代码时,尝试选择此项:
<div id="children" class="children module">
<h3 class="heading">Works inspired by this one:</h3>
<ul>
<li>
<a href="/works/2219289">not an origin story</a> by <a rel="author" href="/users/eiirene/pseuds/eiirene">eiirene</a>
</li>
</ul>
</div>
返回'无'。如果我使用find_all,则返回一个空列表[]。
有什么想法吗?
答案 0 :(得分:1)
我将两个网址都包含在列表中并循环显示。我也使用了请求库以简化。
代码:
import requests, re
from bs4 import BeautifulSoup
urls = ["http://archiveofourown.org/works/1758087?view_full_work=true", "http://archiveofourown.org/works/225563?view_full_work=true"]
for url in urls:
print("***********************")
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
part = soup.find("div", id="children")
soup = BeautifulSoup(str(part), 'html.parser')
nums = soup.findAll('a', href=re.compile('/works/'))
names = soup.findAll('a', href=re.compile('/users/'))
for i in range(0, len(nums)):
print(str(nums[i]['href']).split('/works/')[1] + " : " + names[i].text)
输出:
***********************
2219289 : eiirene
***********************
240745 : blanketforyourshock
341326 : RedOrchid
787850 : ladymac111