我有一个刮板,它在特定产品页面上寻找价格。我只对当前价格感兴趣-无论产品是否在销售。
我将这样的识别标签存储在JSON文件中:
{
"some_ecommerce_site" : {
"product_name" : ["span", "data-test", "product-name"],
"breadcrumb" : ["div", "class", "breadcrumbs"],
"sale_price" : ["span", "data-test", "sale-price"],
"regular_price" : ["span", "data-test", "product-price"]
},
}
并具有以下功能来选择当前价格并清理价格文字:
def get_pricing(rpi, spi):
sale_price = self.soup_object.find(spi[0], {spi[1] : spi[2]})
regular_price = self.soup_object.find(rpi[0], {rpi[1] : rpi[2]})
return sale_price if sale_price else regular_price
def get_text(obj):
return re.sub(r'\s\s+', '', obj.text.strip()).encode('utf-8')
谁叫:
def get_ids(name_of_ecommerce_site):
with open('site_identifiers.json') as j:
return json.load(j)[name_of_ecommerce_site]
def get_data():
rpi = self.site_ids['regular_price']
spi = self.site_ids['sale_price']
product_price = self.get_text( self.get_pricing(rpi, spi) )
到目前为止,该方法适用于除一个网站以外的所有网站,因为其定价格式如下:
<div class="product-price">
<h3>
£15.00
<span class="price-standard">
£35.00
</span>
</h3>
</div>
所以product_price
返回的是“£15£35”而不是期望的“£15”。
是否有一种简单的方法来排除嵌套的<span>
,而嵌套的>> print(type(regular_price))
>> <class 'bs4.element.Tag'>
>> print(regular_price.contents)
>> [u'\n', <h3>\n\n\xa325.00\n\n<span class="price-standard">\n\n\xa341.00\n</span>\n</h3>, u'\n']
在工作站点上不会损坏?
我认为一种解决方案是获取列表并选择索引0,但是检查标签的内容,因为它只是列表中的单个项目,所以将不起作用:
filter(None, [self.get_text(unicode(x)) for x in sale_price.find_all(text=True)])
我尝试从结果的NavigableString元素中创建一个列表,然后过滤掉空字符串:
import java.io.IOException;
import java.net.Authenticator;
import java.net.MalformedURLException;
import java.net.PasswordAuthentication;
import java.util.Properties;
import javax.activation.DataHandler;
import javax.mail.BodyPart;
import javax.mail.Flags;
import javax.mail.Flags.Flag;
import javax.mail.Folder;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.Multipart;
import javax.mail.NoSuchProviderException;
import javax.mail.Session;
import javax.mail.Store;
import javax.mail.search.FlagTerm;
public class ReadMailProxy {
public static void receiveMail(String userName, String password) {
try {
String proxyIP = "124.124.124.14";
String proxyPort = "4154";
String proxyUser = "test";
String proxyPassword = "test123";
Properties prop = new Properties();
prop.setProperty("mail.imaps.proxy.host", proxyIP);
prop.setProperty("mail.imaps.proxy.port", proxyPort);
prop.setProperty("mail.imaps.proxy.user", proxyUser);
prop.setProperty("mail.imaps.proxy.password", proxyPassword);
Session eSession = Session.getInstance(prop);
Store eStore = eSession.getStore("imaps");
eStore.connect("imap.mail.yahoo.com", userName, password);
Folder eFolder = eStore.getFolder("Inbox");
eFolder.open(Folder.READ_WRITE);
Message messages[] = eFolder.search(new FlagTerm(new Flags(Flags.Flag.SEEN), false));
System.out.println(messages.length);
for (int i = messages.length - 3; i < messages.length - 2; i++) {
Message message = messages[i];
System.out.println("Email Number::" + (i + 1));
System.out.println("Subject::" + message.getSubject());
System.out.println("From::" + message.getFrom()[0]);
System.out.println("Date::" + message.getSentDate());
try {
Multipart multipart = (Multipart) message.getContent();
for (int x = 0; x < multipart.getCount(); x++) {
BodyPart bodyPart = multipart.getBodyPart(x);
String disposition = bodyPart.getDisposition();
if (disposition != null && (disposition.equals(BodyPart.ATTACHMENT))) {
System.out.println("Mail have some attachment : ");
DataHandler handler = bodyPart.getDataHandler();
System.out.println("file name : " + handler.getName());
} else {
System.out.println(bodyPart.getContent());
}
}
} catch (Exception e) {
System.out.println("Content: " + message.getContent().toString());
}
message.setFlag(Flag.SEEN, true);
}
eFolder.close(true);
eStore.close();
} catch (NoSuchProviderException e) {
e.printStackTrace();
} catch (MessagingException e) {
e.printStackTrace();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
receiveMail("umesh@yahoo.com", "test123");
}
}
这可以解决一个问题,但会破坏其他情况(因为它们通常使用的货币标签的价值与金额不同)-我取回“£”。
答案 0 :(得分:1)
如果要获取不带任何子元素的文本。可以这样做
from bs4 import BeautifulSoup,NavigableString
html = """
<div class="product-price">
<h3>
£15.00
<span class="price-standard">
£35.00
</span>
</h3>
</div>
"""
bs = BeautifulSoup(html,"xml")
result = bs.find("div",{"class":"product-price"})
fr = [element for element in result.h3 if isinstance(element, NavigableString)]
print(fr[0])
问题可能与this
重复