我不是Web爬网专家,但是我设法获得了大部分想要的东西。但是,我很难解析代码的最后一部分,即背景图片。
这就是我所拥有的:
htmlSource.find('div', class_='flex-embed-content flex-embed-cover-image ')
哪个返回:
<div class="flex-embed-content flex-embed-cover-image " style="background-image: url('//site.org/photos/0/kp/cr/QOKPCRqjkbbldlo-400x225-noPad.jpg?1528717310')"></div>
我被URL //site.org/photos/0/kp/cr/QOKPCRqjkbbldlo-400x225-noPad.jpg?1528717310
迷住了
如何从htmlSource解析它?
谢谢
答案 0 :(得分:1)
获取样式属性并使用字符串操作。下面的示例方法(显然还有其他方法)
from bs4 import BeautifulSoup as bs
html = '''<div class="flex-embed-content flex-embed-cover-image " style="background-image: url('//site.org/photos/0/kp/cr/QOKPCRqjkbbldlo-400x225-noPad.jpg?1528717310')"></div>'''
soup = bs(html, 'lxml')
item = soup.select_one('div.flex-embed-content.flex-embed-cover-image')
item['style'].split("url('")[1][:-3]
请注意,由于存在基于html的单个匹配项,因此我使用select_one。您可以将select与包括样式属性div.flex-embed-content.flex-embed-cover-image[style]
和循环的选择器一起使用。我还要看看是否可以减少选择器中使用的类数量。
答案 1 :(得分:1)
首先,您应该获取div
元素,并且有很多方法可以这样做,但是由于您有一个非常特定的类,因此就足够了(这里假设您的html代码存储在{{1 }}变量:
htmlSource
现在,您应该使用soup = BeautifulSoup(htmlSource, "html.parser")
divElement = soup.select_one('div.flex-embed-content.flex-embed-cover-image')
属性,并将其过滤为url,我建议使用regex,这样就不会出现在样式超时中添加意外元素的问题:
style
正则表达式中的pattern = r"(?<=url\().*(?='\))"
url = re.search(pattern, divElement["style"]).group(0) # The group(0) is used to recover the whole match
内容假定我们的匹配以该TEXT_BEFORE内容开头,但不包含在匹配中( lookbegind断言),以及{{1} }告诉相反的情况,并且仅当匹配项后跟TEXT_AFTER( lookahead断言)
因此完整的代码应为:
(?<=TEXT_BEFORE)
答案 2 :(得分:0)
style属性包含CSS,beautifulsoup不知道该如何解析。
那么第一件事-获取样式属性内容。现在您需要解析CSS。您可以自己解析(查找url(...)
),如果网站变化不大,它将可以正常工作。
另一种选择是使用专用的CSS解析器,例如tinycss。我将使用CSS解析器,您的代码将对站点更改更具弹性。
答案 3 :(得分:0)
我也是网络抓取的初学者,这是您解决问题的方法。
first=htmlSource.find('div', class_='flex-embed-content flex-embed-cover-image ')
get_style=first['style']
break_url=get_style.split(':')
break_url=break_url[1]
break_url=break_url.split("'")
final_url=break_url(1)
答案 4 :(得分:0)
解决方案之一是使用// The System.Security.dll assembly should be added into References
// Signing 1. Using SignatureField
public void Sign_With_SmartCard_1()
{
const string dataDir = @"c:\";
File.Copy(dataDir + "blank.pdf", dataDir + "externalSignature1.pdf", true);
using (FileStream fs = new FileStream(dataDir + "externalSignature1.pdf", FileMode.Open, FileAccess.ReadWrite))
{
using (Document doc = new Document(fs))
{
SignatureField field1 = new SignatureField(doc.Pages[1], new Rectangle(100, 400, 10, 10));
// Sign with certificate selection in the windows certificate store
X509Store store = new X509Store(StoreLocation.CurrentUser);
store.Open(OpenFlags.ReadOnly);
// Manually chose the certificate in the store
X509Certificate2Collection sel = X509Certificate2UI.SelectFromCollection(store.Certificates, null, null, X509SelectionFlag.SingleSelection);
Aspose.Pdf.Forms.ExternalSignature externalSignature = new Forms.ExternalSignature(sel[0])
{
Authority = "Me",
Reason = "Reason",
ContactInfo = "Contact"
};
field1.PartialName = "sig1";
doc.Form.Add(field1, 1);
field1.Sign(externalSignature);
doc.Save();
}
}
using (PdfFileSignature pdfSign = new PdfFileSignature(dataDir + "externalSignature1.pdf"))
{
IList<string> sigNames = pdfSign.GetSignNames();
for (int index = 0; index <= sigNames.Count - 1; index++)
{
if (!pdfSign.VerifySigned(sigNames[index]) || !pdfSign.VerifySignature(sigNames[index]))
{
throw new ApplicationException("Not verified");
}
}
}
}
// Signing 2. Using PdfFileSignature
public void Sign_With_SmartCard_2()
{
const string dataDir = @"c:\";
Document doc = new Document(dataDir + "blank.pdf");
using (PdfFileSignature pdfSign = new PdfFileSignature())
{
pdfSign.BindPdf(doc);
//Sign with certificate selection in the windows certificate store
X509Store store = new X509Store(StoreLocation.CurrentUser);
store.Open(OpenFlags.ReadOnly);
//manually chose the certificate in the store
X509Certificate2Collection sel = X509Certificate2UI.SelectFromCollection(store.Certificates, null, null, X509SelectionFlag.SingleSelection);
Aspose.Pdf.Forms.ExternalSignature externalSignature = new Forms.ExternalSignature(sel[0]);
pdfSign.SignatureAppearance = dataDir + "demo.png";
pdfSign.Sign(1, "Reason", "Contact", "Location", true, new System.Drawing.Rectangle(100, 100, 200, 200), externalSignature);
pdfSign.Save(dataDir + "externalSignature2.pdf");
}
using (PdfFileSignature pdfSign = new PdfFileSignature(dataDir + "externalSignature2.pdf"))
{
IList<string> sigNames = pdfSign.GetSignNames();
for (int index = 0; index <= sigNames.Count - 1; index++)
{
if (!pdfSign.VerifySigned(sigNames[index]) || !pdfSign.VerifySignature(sigNames[index]))
{
throw new ApplicationException("Not verified");
}
}
}
}
。此类有助于查找string中的url。
实施:
urlextract
用法:
from urlextract import URLExtract
代码:
extractor = URLExtract()
输出:
soup = BeautifulSoup(html,"lxml")
finddiv = htmlSource.find('div', class_='flex-embed-content flex-embed-cover-image')
style = finddiv['style']
for url in extractor.gen_urls(style):
print (url)
print('----')
print('//'+url)