嵌入BS4元素的div类中的背景图像的URL

时间:2019-02-28 20:40:54

标签: python web-scraping beautifulsoup

我不是Web爬网专家,但是我设法获得了大部分想要的东西。但是,我很难解析代码的最后一部分,即背景图片。

这就是我所拥有的:

htmlSource.find('div', class_='flex-embed-content flex-embed-cover-image ')

哪个返回:

<div class="flex-embed-content flex-embed-cover-image " style="background-image: url('//site.org/photos/0/kp/cr/QOKPCRqjkbbldlo-400x225-noPad.jpg?1528717310')"></div>

我被URL //site.org/photos/0/kp/cr/QOKPCRqjkbbldlo-400x225-noPad.jpg?1528717310迷住了

如何从htmlSource解析它?

谢谢

5 个答案:

答案 0 :(得分:1)

获取样式属性并使用字符串操作。下面的示例方法(显然还有其他方法)

from bs4 import BeautifulSoup as bs

html = '''<div class="flex-embed-content flex-embed-cover-image " style="background-image: url('//site.org/photos/0/kp/cr/QOKPCRqjkbbldlo-400x225-noPad.jpg?1528717310')"></div>'''

soup = bs(html, 'lxml')

item = soup.select_one('div.flex-embed-content.flex-embed-cover-image')
item['style'].split("url('")[1][:-3]

请注意,由于存在基于html的单个匹配项,因此我使用select_one。您可以将select与包括样式属性div.flex-embed-content.flex-embed-cover-image[style]和循环的选择器一起使用。我还要看看是否可以减少选择器中使用的类数量。

答案 1 :(得分:1)

首先,您应该获取div元素,并且有很多方法可以这样做,但是由于您有一个非常特定的类,因此就足够了(这里假设您的html代码存储在{{1 }}变量:

htmlSource

现在,您应该使用soup = BeautifulSoup(htmlSource, "html.parser") divElement = soup.select_one('div.flex-embed-content.flex-embed-cover-image') 属性,并将其过滤为url,我建议使用regex,这样就不会出现在样式超时中添加意外元素的问题:

style

正则表达式中的pattern = r"(?<=url\().*(?='\))" url = re.search(pattern, divElement["style"]).group(0) # The group(0) is used to recover the whole match 内容假定我们的匹配以该TEXT_BEFORE内容开头,但不包含在匹配中( lookbegind断言),以及{{1} }告诉相反的情况,并且仅当匹配项后跟TEXT_AFTER( lookahead断言

时才匹配

因此完整的代码应为:

(?<=TEXT_BEFORE)

答案 2 :(得分:0)

style属性包含CSS,beautifulsoup不知道该如何解析。

那么第一件事-获取样式属性内容。现在您需要解析CSS。您可以自己解析(查找url(...)),如果网站变化不大,它将可以正常工作。

另一种选择是使用专用的CSS解析器,例如tinycss。我将使用CSS解析器,您的代码将对站点更改更具弹性。

答案 3 :(得分:0)

我也是网络抓取的初学者,这是您解决问题的方法。

first=htmlSource.find('div', class_='flex-embed-content flex-embed-cover-image ')
get_style=first['style']
break_url=get_style.split(':')
break_url=break_url[1]
break_url=break_url.split("'")
final_url=break_url(1)

答案 4 :(得分:0)

解决方案之一是使用// The System.Security.dll assembly should be added into References // Signing 1. Using SignatureField public void Sign_With_SmartCard_1() { const string dataDir = @"c:\"; File.Copy(dataDir + "blank.pdf", dataDir + "externalSignature1.pdf", true); using (FileStream fs = new FileStream(dataDir + "externalSignature1.pdf", FileMode.Open, FileAccess.ReadWrite)) { using (Document doc = new Document(fs)) { SignatureField field1 = new SignatureField(doc.Pages[1], new Rectangle(100, 400, 10, 10)); // Sign with certificate selection in the windows certificate store X509Store store = new X509Store(StoreLocation.CurrentUser); store.Open(OpenFlags.ReadOnly); // Manually chose the certificate in the store X509Certificate2Collection sel = X509Certificate2UI.SelectFromCollection(store.Certificates, null, null, X509SelectionFlag.SingleSelection); Aspose.Pdf.Forms.ExternalSignature externalSignature = new Forms.ExternalSignature(sel[0]) { Authority = "Me", Reason = "Reason", ContactInfo = "Contact" }; field1.PartialName = "sig1"; doc.Form.Add(field1, 1); field1.Sign(externalSignature); doc.Save(); } } using (PdfFileSignature pdfSign = new PdfFileSignature(dataDir + "externalSignature1.pdf")) { IList<string> sigNames = pdfSign.GetSignNames(); for (int index = 0; index <= sigNames.Count - 1; index++) { if (!pdfSign.VerifySigned(sigNames[index]) || !pdfSign.VerifySignature(sigNames[index])) { throw new ApplicationException("Not verified"); } } } } // Signing 2. Using PdfFileSignature public void Sign_With_SmartCard_2() { const string dataDir = @"c:\"; Document doc = new Document(dataDir + "blank.pdf"); using (PdfFileSignature pdfSign = new PdfFileSignature()) { pdfSign.BindPdf(doc); //Sign with certificate selection in the windows certificate store X509Store store = new X509Store(StoreLocation.CurrentUser); store.Open(OpenFlags.ReadOnly); //manually chose the certificate in the store X509Certificate2Collection sel = X509Certificate2UI.SelectFromCollection(store.Certificates, null, null, X509SelectionFlag.SingleSelection); Aspose.Pdf.Forms.ExternalSignature externalSignature = new Forms.ExternalSignature(sel[0]); pdfSign.SignatureAppearance = dataDir + "demo.png"; pdfSign.Sign(1, "Reason", "Contact", "Location", true, new System.Drawing.Rectangle(100, 100, 200, 200), externalSignature); pdfSign.Save(dataDir + "externalSignature2.pdf"); } using (PdfFileSignature pdfSign = new PdfFileSignature(dataDir + "externalSignature2.pdf")) { IList<string> sigNames = pdfSign.GetSignNames(); for (int index = 0; index <= sigNames.Count - 1; index++) { if (!pdfSign.VerifySigned(sigNames[index]) || !pdfSign.VerifySignature(sigNames[index])) { throw new ApplicationException("Not verified"); } } } } 。此类有助于查找string中的url。

实施:

urlextract

用法:

from urlextract import URLExtract

代码:

extractor = URLExtract()

输出:

soup = BeautifulSoup(html,"lxml")
finddiv = htmlSource.find('div', class_='flex-embed-content flex-embed-cover-image')

style = finddiv['style']

for url in extractor.gen_urls(style):

    print (url)
    print('----')
    print('//'+url)