如何使用PYTHON从以下列表中提取手机名称?

时间:2019-05-11 08:06:21

标签: python web-scraping data-science

我从网站上提取了数据,最后得到一个包含span标签和所需数据的列表,我进行了一些调整,但找不到合适的方法。我想删除span标签,仅检索手机的名称和信息。

[<span class="a-size-medium a-color-base a-text-normal">Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty)</span>, <span class="a-size-medium a-color-base a-text-normal">Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty)</span>, <span class="a-size-medium a-color-base a-text-normal">Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty)</span>, <span class="a-size-medium a-color-base a-text-normal">Huawei Honor 8X (64GB + 4GB RAM) 6.5" HD 4G LTE GSM Factory Unlocked Smartphone - International Version No Warranty JSN-L23 (Black)</span>, <span class="a-size-medium a-color-base a-text-normal">Huawei Honor 8X (64GB + 4GB RAM) 6.5" HD 4G LTE GSM Factory Unlocked Smartphone - International Version No Warranty JSN-L23 (Black)</span>]

我想要输出如下:

[ Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey,Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty),Huawei Honor 8X (64GB + 4GB RAM) 6.5" HD 4G LTE GSM Factory Unlocked Smartphone - International Version No Warranty JSN-L23 (Black)]

以上列表仅包含主列表中的几个元素。稍后我将删除多个条目。

3 个答案:

答案 0 :(得分:1)

您的意思是这样的吗?

txt = "<span class=\"a-size-medium a-color-base a-text-normal\">Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty)</span>, <span class=\"a-size-medium a-color-base a-text-normal\">Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty)</span>, <span class=\"a-size-medium a-color-base a-text-normal\">Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty)</span>, <span class=\"a-size-medium a-color-base a-text-normal\">Huawei Honor 8X (64GB + 4GB RAM) 6.5\" HD 4G LTE GSM Factory Unlocked Smartphone - International Version No Warranty JSN-L23 (Black)</span>, <span class=\"a-size-medium a-color-base a-text-normal\">Huawei Honor 8X (64GB + 4GB RAM) 6.5\" HD 4G LTE GSM Factory Unlocked Smartphone - International Version No Warranty JSN-L23 (Black)</span>"
x = txt.split(", ")
mylist = list(dict.fromkeys(x))
list = []
for val in mylist:
    if ("</span>" in val):
        val = val[:val.rfind("</span>")]
    if ("<span class=\"a-size-medium a-color-base a-text-normal\">" in val):
        val = val[len("<span class=\"a-size-medium a-color-base a-text-normal\">"):]

    list.append(val)

答案 1 :(得分:0)

s = 'Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty),Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty),Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty), Huawei Honor 8X (64GB + 4GB RAM) 6.5" HD 4G LTE GSM Factory Unlocked Smartphone - International Version No Warranty JSN-L23 (Black), Huawei Honor 8X (64GB + 4GB RAM) 6.5" HD 4G LTE GSM Factory Unlocked Smartphone - International Version No Warranty JSN-L23 (Black)'

sp = s.split(",")
gl = []

for sk in sp:   
    gl.append(sk)

from collections import OrderedDict    
res = list(OrderedDict.fromkeys(gl))

print (res)

(由于您的每一项在a之后都被替换,因此我将其用作分隔符。请确保以后再进行对齐都是正确的)希望对您有所帮助

答案 2 :(得分:0)

提取列表中的每个项目并加载到BeautifulSoup中,选择所有span标签。如果您有实际的字符串列表,我希望''将字符串括起来。添加到集合中以删除重复项。

from bs4 import BeautifulSoup as bs

aList = ['<span class="a-size-medium a-color-base a-text-normal">Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty)</span>, <span class="a-size-medium a-color-base a-text-normal">Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty)</span>, <span class="a-size-medium a-color-base a-text-normal">Huawei Mate SE Factory Unlocked 5.93” - 4GB/64GB Octa-core Processor| 16MP + 2MP Dual Camera| GSM Only |Grey (US Warranty)</span>, <span class="a-size-medium a-color-base a-text-normal">Huawei Honor 8X (64GB + 4GB RAM) 6.5" HD 4G LTE GSM Factory Unlocked Smartphone - International Version No Warranty JSN-L23 (Black)</span>, <span class="a-size-medium a-color-base a-text-normal">Huawei Honor 8X (64GB + 4GB RAM) 6.5" HD 4G LTE GSM Factory Unlocked Smartphone - International Version No Warranty JSN-L23 (Black)</span>']
for i in aList:
    soup = bs(i, 'lxml')
    text = [item.text for item in soup.select('span')]  #list
    print(text)
    text = ','.join([item.text for item in soup.select('span')]) #comma separated string
    print(text)