我正在学习网络抓取。我写了以下代码:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url= 'DON'T WANT TO SHARE'
uClient= uReq(my_url)
page_html= uClient.read()
uClient.close()
page_soup= soup(page_html, "html.parser")
contents=page_soup.findAll("data")
print (contents)
打印完内容后,我得到这样的内容:
<data>
------------------------------------
SIM: B01N2W56MD
(P)UBLISHER NAME: Monster
------------------------------------
(I)[ 0] Publisher: Monster
(I)[ 1] Title: Monster
(I)[12] Subject Keyword: nos
------------------------------------
(S)[ 0] Marketplace ID: 1
(S)[ 1] Replenishment Category: Non Replenishable
(S)[ 5] Title type: Main title 1
(S)[ 9] Product Group: No operation Product Handling Group
(S)[19] Product Subcategory: A
(S)[32] Are batteries required?: N
------------------------------------
(K)[ 0] IDC: 030347493342
(K)[ 1] ORC: 6800532606463
------------------------------------
</data>
如何提取这些值并打印或存储它们,即SIM或Title或IDC和ORC的值。
答案 0 :(得分:0)
您可以使用regular expressions
提取这些值import re
data = """
<data>
------------------------------------
SIM: B01N2W56MD
(P)UBLISHER NAME: Monster
------------------------------------
(I)[ 0] Publisher: Monster
(I)[ 1] Title: Monster
(I)[12] Subject Keyword: nos
------------------------------------
(S)[ 0] Marketplace ID: 1
(S)[ 1] Replenishment Category: Non Replenishable
(S)[ 5] Title type: Main title 1
(S)[ 9] Product Group: No operation Product Handling Group
(S)[19] Product Subcategory: A
(S)[32] Are batteries required?: N
------------------------------------
(K)[ 0] IDC: 030347493342
(K)[ 1] ORC: 6800532606463
------------------------------------
</data>"""
sim= re.search(r'SIM:\s(.*?)\n', data).group(1)
dic= re.search(r'IDC:\s(.*?)\n', data).group(1)
title = re.search(r'Title:\s(.*?)\n', data).group(1)
print(sim)
print(dic)
print(title)
上面的代码只是查找"SIM"
和"\n"
(换行符)中的数据,并将该数据保存在变量中。 Exactl相同的逻辑适用于查找"DIC"
和"Title"
的值。