Question

我不是Web爬网专家，但是我设法获得了大部分想要的东西。但是，我很难解析代码的最后一部分，即背景图片。

这就是我所拥有的：

htmlSource.find('div', class_='flex-embed-content flex-embed-cover-image ')

哪个返回：

<div class="flex-embed-content flex-embed-cover-image " style="background-image: url('//site.org/photos/0/kp/cr/QOKPCRqjkbbldlo-400x225-noPad.jpg?1528717310')"></div>

我被URL //site.org/photos/0/kp/cr/QOKPCRqjkbbldlo-400x225-noPad.jpg?1528717310迷住了

如何从htmlSource解析它？

谢谢

Answer 1

获取样式属性并使用字符串操作。下面的示例方法（显然还有其他方法）

from bs4 import BeautifulSoup as bs

html = '''<div class="flex-embed-content flex-embed-cover-image " style="background-image: url('//site.org/photos/0/kp/cr/QOKPCRqjkbbldlo-400x225-noPad.jpg?1528717310')"></div>'''

soup = bs(html, 'lxml')

item = soup.select_one('div.flex-embed-content.flex-embed-cover-image')
item['style'].split("url('")[1][:-3]

请注意，由于存在基于html的单个匹配项，因此我使用select_one。您可以将select与包括样式属性div.flex-embed-content.flex-embed-cover-image[style]和循环的选择器一起使用。我还要看看是否可以减少选择器中使用的类数量。

Answer 2

首先，您应该获取div元素，并且有很多方法可以这样做，但是由于您有一个非常特定的类，因此就足够了（这里假设您的html代码存储在{{1 }}变量：

htmlSource

现在，您应该使用soup = BeautifulSoup(htmlSource, "html.parser") divElement = soup.select_one('div.flex-embed-content.flex-embed-cover-image')属性，并将其过滤为url，我建议使用regex，这样就不会出现在样式超时中添加意外元素的问题：

style

正则表达式中的pattern = r"(?<=url\().*(?='\))" url = re.search(pattern, divElement["style"]).group(0) # The group(0) is used to recover the whole match内容假定我们的匹配以该TEXT_BEFORE内容开头，但不包含在匹配中（ lookbegind断言），以及{{1} }告诉相反的情况，并且仅当匹配项后跟TEXT_AFTER（ lookahead断言）

时才匹配

因此完整的代码应为：

(?<=TEXT_BEFORE)

Answer 3

style属性包含CSS，beautifulsoup不知道该如何解析。

那么第一件事-获取样式属性内容。现在您需要解析CSS。您可以自己解析（查找url(...)），如果网站变化不大，它将可以正常工作。

另一种选择是使用专用的CSS解析器，例如tinycss。我将使用CSS解析器，您的代码将对站点更改更具弹性。

Answer 4

我也是网络抓取的初学者，这是您解决问题的方法。

first=htmlSource.find('div', class_='flex-embed-content flex-embed-cover-image ')
get_style=first['style']
break_url=get_style.split(':')
break_url=break_url[1]
break_url=break_url.split("'")
final_url=break_url(1)

Answer 5

解决方案之一是使用// The System.Security.dll assembly should be added into References // Signing 1. Using SignatureField public void Sign_With_SmartCard_1() { const string dataDir = @"c:\"; File.Copy(dataDir + "blank.pdf", dataDir + "externalSignature1.pdf", true); using (FileStream fs = new FileStream(dataDir + "externalSignature1.pdf", FileMode.Open, FileAccess.ReadWrite)) { using (Document doc = new Document(fs)) { SignatureField field1 = new SignatureField(doc.Pages[1], new Rectangle(100, 400, 10, 10)); // Sign with certificate selection in the windows certificate store X509Store store = new X509Store(StoreLocation.CurrentUser); store.Open(OpenFlags.ReadOnly); // Manually chose the certificate in the store X509Certificate2Collection sel = X509Certificate2UI.SelectFromCollection(store.Certificates, null, null, X509SelectionFlag.SingleSelection); Aspose.Pdf.Forms.ExternalSignature externalSignature = new Forms.ExternalSignature(sel[0]) { Authority = "Me", Reason = "Reason", ContactInfo = "Contact" }; field1.PartialName = "sig1"; doc.Form.Add(field1, 1); field1.Sign(externalSignature); doc.Save(); } } using (PdfFileSignature pdfSign = new PdfFileSignature(dataDir + "externalSignature1.pdf")) { IList<string> sigNames = pdfSign.GetSignNames(); for (int index = 0; index <= sigNames.Count - 1; index++) { if (!pdfSign.VerifySigned(sigNames[index]) || !pdfSign.VerifySignature(sigNames[index])) { throw new ApplicationException("Not verified"); } } } } // Signing 2. Using PdfFileSignature public void Sign_With_SmartCard_2() { const string dataDir = @"c:\"; Document doc = new Document(dataDir + "blank.pdf"); using (PdfFileSignature pdfSign = new PdfFileSignature()) { pdfSign.BindPdf(doc); //Sign with certificate selection in the windows certificate store X509Store store = new X509Store(StoreLocation.CurrentUser); store.Open(OpenFlags.ReadOnly); //manually chose the certificate in the store X509Certificate2Collection sel = X509Certificate2UI.SelectFromCollection(store.Certificates, null, null, X509SelectionFlag.SingleSelection); Aspose.Pdf.Forms.ExternalSignature externalSignature = new Forms.ExternalSignature(sel[0]); pdfSign.SignatureAppearance = dataDir + "demo.png"; pdfSign.Sign(1, "Reason", "Contact", "Location", true, new System.Drawing.Rectangle(100, 100, 200, 200), externalSignature); pdfSign.Save(dataDir + "externalSignature2.pdf"); } using (PdfFileSignature pdfSign = new PdfFileSignature(dataDir + "externalSignature2.pdf")) { IList<string> sigNames = pdfSign.GetSignNames(); for (int index = 0; index <= sigNames.Count - 1; index++) { if (!pdfSign.VerifySigned(sigNames[index]) || !pdfSign.VerifySignature(sigNames[index])) { throw new ApplicationException("Not verified"); } } } }。此类有助于查找string中的url。

实施：

urlextract

用法：

from urlextract import URLExtract

代码：

extractor = URLExtract()

输出：

soup = BeautifulSoup(html,"lxml")
finddiv = htmlSource.find('div', class_='flex-embed-content flex-embed-cover-image')

style = finddiv['style']

for url in extractor.gen_urls(style):

    print (url)
    print('----')
    print('//'+url)

嵌入BS4元素的div类中的背景图像的URL

5 个答案: