BeautifulSoup查找-从感兴趣的块中排除嵌套标签

时间:2018-10-17 12:40:06

标签: python beautifulsoup nested

我有一个刮板,它在特定产品页面上寻找价格。我只对当前价格感兴趣-无论产品是否在销售。

我将这样的识别标签存储在JSON文件中:

{
    "some_ecommerce_site" : {
        "product_name" : ["span", "data-test", "product-name"],
        "breadcrumb" : ["div", "class", "breadcrumbs"],
        "sale_price" : ["span", "data-test", "sale-price"],
        "regular_price" : ["span", "data-test", "product-price"]
    },
}

并具有以下功能来选择当前价格并清理价格文字:

def get_pricing(rpi, spi):
    sale_price = self.soup_object.find(spi[0], {spi[1] : spi[2]})
    regular_price = self.soup_object.find(rpi[0], {rpi[1] : rpi[2]})

    return sale_price if sale_price else regular_price

def get_text(obj):
    return re.sub(r'\s\s+', '', obj.text.strip()).encode('utf-8')

谁叫:

def get_ids(name_of_ecommerce_site):
    with open('site_identifiers.json') as j:
        return json.load(j)[name_of_ecommerce_site]

def get_data():
    rpi = self.site_ids['regular_price']
    spi = self.site_ids['sale_price']

    product_price = self.get_text( self.get_pricing(rpi, spi) )

到目前为止,该方法适用于除一个网站以外的所有网站,因为其定价格式如下:

<div class="product-price">
    <h3>
    £15.00
        <span class="price-standard">
            £35.00
        </span>
    </h3>
</div>

所以product_price返回的是“£15£35”而不是期望的“£15”。

是否有一种简单的方法来排除嵌套的<span>,而嵌套的>> print(type(regular_price)) >> <class 'bs4.element.Tag'> >> print(regular_price.contents) >> [u'\n', <h3>\n\n\xa325.00\n\n<span class="price-standard">\n\n\xa341.00\n</span>\n</h3>, u'\n'] 在工作站点上不会损坏?

我认为一种解决方案是获取列表并选择索引0,但是检查标签的内容,因为它只是列表中的单个项目,所以将不起作用:

filter(None, [self.get_text(unicode(x)) for x in sale_price.find_all(text=True)])

我尝试从结果的NavigableString元素中创建一个列表,然后过滤掉空字符串:

import java.io.IOException;
import java.net.Authenticator;
import java.net.MalformedURLException;
import java.net.PasswordAuthentication;
import java.util.Properties;
import javax.activation.DataHandler;
import javax.mail.BodyPart;
import javax.mail.Flags;
import javax.mail.Flags.Flag;
import javax.mail.Folder;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.Multipart;
import javax.mail.NoSuchProviderException;
import javax.mail.Session;
import javax.mail.Store;
import javax.mail.search.FlagTerm;

public class ReadMailProxy {

    public static void receiveMail(String userName, String password) {
        try {
            String proxyIP = "124.124.124.14";
            String proxyPort = "4154";
            String proxyUser = "test";
            String proxyPassword = "test123";
            Properties prop = new Properties();
            prop.setProperty("mail.imaps.proxy.host", proxyIP);
            prop.setProperty("mail.imaps.proxy.port", proxyPort);
            prop.setProperty("mail.imaps.proxy.user", proxyUser);
            prop.setProperty("mail.imaps.proxy.password", proxyPassword);

            Session eSession = Session.getInstance(prop);

            Store eStore = eSession.getStore("imaps");
            eStore.connect("imap.mail.yahoo.com", userName, password);

            Folder eFolder = eStore.getFolder("Inbox");
            eFolder.open(Folder.READ_WRITE);
            Message messages[] = eFolder.search(new FlagTerm(new Flags(Flags.Flag.SEEN), false));
            System.out.println(messages.length);
            for (int i = messages.length - 3; i < messages.length - 2; i++) {
                Message message = messages[i];
                System.out.println("Email Number::" + (i + 1));
                System.out.println("Subject::" + message.getSubject());
                System.out.println("From::" + message.getFrom()[0]);
                System.out.println("Date::" + message.getSentDate());

                try {
                    Multipart multipart = (Multipart) message.getContent();

                    for (int x = 0; x < multipart.getCount(); x++) {
                        BodyPart bodyPart = multipart.getBodyPart(x);

                        String disposition = bodyPart.getDisposition();

                        if (disposition != null && (disposition.equals(BodyPart.ATTACHMENT))) {
                            System.out.println("Mail have some attachment : ");

                            DataHandler handler = bodyPart.getDataHandler();
                            System.out.println("file name : " + handler.getName());
                        } else {
                            System.out.println(bodyPart.getContent());
                        }

                    }
                } catch (Exception e) {
                    System.out.println("Content: " + message.getContent().toString());
                }

                message.setFlag(Flag.SEEN, true);
            }
            eFolder.close(true);
            eStore.close();

        } catch (NoSuchProviderException e) {
            e.printStackTrace();
        } catch (MessagingException e) {
            e.printStackTrace();
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

    public static void main(String[] args) {
        receiveMail("umesh@yahoo.com", "test123");
    }

}

这可以解决一个问题,但会破坏其他情况(因为它们通常使用的货币标签的价值与金额不同)-我取回“£”。

1 个答案:

答案 0 :(得分:1)

如果要获取不带任何子元素的文本。可以这样做

from bs4 import BeautifulSoup,NavigableString


html = """
<div class="product-price">
    <h3>
    £15.00
        <span class="price-standard">
            £35.00
        </span>
    </h3>
</div>
"""
bs = BeautifulSoup(html,"xml")
result = bs.find("div",{"class":"product-price"})
fr = [element for element in result.h3 if isinstance(element, NavigableString)]
print(fr[0])
  

问题可能与this

重复