我需要从公共站点抓取小型企业信息
这是html格式
<div class="listings">
<ul>
<li>
<h3>Machine Machine Company Inc</h3>
</li>
<li><a href="#government_funding" style="font-size:.8em;">View funding actions</a></li>
<li>Alexandria, AL 36250</li>
<li><span style="font-weight: bold;">Phone</span>: 256-830-3440</li>
<li><span style="font-weight: bold;">Estimated Number of Employees</span>: 64</li>
<li><span style="font-weight: bold;">Estimated Annual Receipts</span>: $9,691,383</li>
<li><span style="font-weight: bold;">Business Start Date</span>: 1971</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Contact Person</span>: James HOland</li>
<li><span style="font-weight: bold;">Contact Phone</span>: 256-820-3440</li>
<li><span style="font-weight: bold;">Contact Email</span>: hhx@cableone.net</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Business Structure</span>:</li>
<li>Corporate Entity (Not Tax Exempt)</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Business Type</span>:</li>
<li>For Profit Organization</li>
<li>Manufacturer of Goods</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Industries Served</span>: All Other Miscellaneous Fabricated Metal Product Manufacturing, All Other Miscellaneous General Purpose Machinery Manufacturing</li>
</ul>
<div style="padding-top: 10px;" id="government_funding">
<h2>Sampling of Recent Funding Actions/Set Asides</h2>
<p style="font-style: italic; font-size: .8em;">In order by amount of set aside monies.</p>
<ul>
<li><span style="color: green;">$500,000</span> - Tuesday the 29th of November 2016<br><span style="font-weight: bold; font-size: 1.2em;">Department Of Army</span> <br> W0LX ANNISTON DEPOT PROP DIV<br>IGF::CT::IGF. INCREASE FUNDING FOR THE ABRASIVE CLEAN OF VARIOUS PARTS
<hr>
</li>
</ul>
</div>
</div>
我关于如何提取数据的计划是将所有“ ul”标签放入容器中,然后根据索引号遍历容器中的所有ul,以找到所需的文本(例如电子邮件)。所以我有这个python脚本试图检索电子邮件地址:
companydriver.get(weburl)
businessesoup = BeautifulSoup(companydriver.page_source,"html5lib");
#GET BUSINESS DATA
businesscontainer = businessesoup.find_all("ul")
dataresult = [c for c in businesscontainer]
print(colorama.Fore.BLUE + str(dataresult))
for idx, datacell in enumerate(dataresult, start=0):
# arraylenght = dataresult.lenght
# print("this is dataresult", dataresult)
print("Index ", str(idx))
print(colorama.Fore.RED +'This is data cell',str(datacell))
print(" ")
if (idx == 1) :
emailaddress = dataresult.find("span").text
print(colorama.Fore.GREEN + str(emailaddress))
问题是我似乎无法获得电子邮件地址。
我需要提取以下项目:
我如何轻松提取电子邮件地址和其余地址?
答案 0 :(得分:1)
您现在要执行的操作将不起作用,因为您正在从[PXOverride]
public virtual void DoCreateSalesOrder(OpportunityMaint.CreateSalesOrderFilter param, Action<OpportunityMaint.CreateSalesOrderFilter> del)
{
PXGraph.InstanceCreated.AddHandler<SOOrderEntry>(graph =>
{
graph.RowInserting.AddHandler<SOLine>((cache, args) =>
{
var soLine = (SOLine)args.Row;
if (soLine == null)
{
return;
}
CROpportunityProducts opProduct = PXResult<CROpportunityProducts>.Current;
if (opProduct == null)
{
return;
}
var opProductExt = PXCache<CROpportunityProducts>.GetExtension<CROpportunityProductsExt>(opProduct);
var soLineExt = PXCache<SOLine>.GetExtension<SOLineExt>(soLine);
//Copy all extension fields here...
});
});
del(param);
}
元素中提取文本,而所要获取的信息位于<span>
元素中的<li>
元素中包含。我建议您执行以下操作:
对于每个<span>
元素:
<li>
元素,如果包含,则该元素的文本是什么。<span>
元素,例如带有文本“ Contact Email”,则您知道<span>
元素包含所需的信息。<li>
元素,则可以提取其文本内容。这可能还包含(例如)“联系电子邮件”文本,因此您将需要进行一些后处理,但这并不是整个任务中最困难的部分。编辑:代码
根据您的代码,您可能会执行以下操作来提取电子邮件地址(注意:不能保证正常工作,但这不是重点)
<li>
答案 1 :(得分:1)
您可以尝试直接将文本用作find_all参数。 https://www.crummy.com/software/BeautifulSoup/bs4/doc/
示例:
strings_to_search_for = ["Phone", "Estimated Number of Employees"]
businesscontainer = businessesoup.find_all(string=strings_to_search_for )
for element in businesscontainer:
value = element.parent.text # get <li> value
# do something ...
希望有帮助。
答案 2 :(得分:0)
您可以使用RE查找所需的字符串,然后获取该对象的父对象:
修改
说明:使用text = re.recomiple命令,我们可以将regex表达式应用于我们漂亮的汤对象的文本值。在这种情况下,我们对span标签感兴趣。因此,由于我们知道html中的文本,因此可以通过正则表达式应用多个语句。正则表达式中的^运算符将匹配一个字符串值,()将变为子表达式或匹配组。因此,我将您的每个条件都应用为匹配组,而| (条形)符号作为逻辑或条件。
from bs4 import BeautifulSoup
import re
html = """
<div class="listings">
<ul>
<li>
<h3>Machine Machine Company Inc</h3>
</li>
<li><a href="#government_funding" style="font-size:.8em;">View funding actions</a></li>
<li>Alexandria, AL 36250</li>
<li><span style="font-weight: bold;">Phone</span>: 256-830-3440</li>
<li><span style="font-weight: bold;">Estimated Number of Employees</span>: 64</li>
<li><span style="font-weight: bold;">Estimated Annual Receipts</span>: $9,691,383</li>
<li><span style="font-weight: bold;">Business Start Date</span>: 1971</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Contact Person</span>: James HOland</li>
<li><span style="font-weight: bold;">Contact Phone</span>: 256-820-3440</li>
<li><span style="font-weight: bold;">Contact Email</span>: hhx@cableone.net</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Business Structure</span>:</li>
<li>Corporate Entity (Not Tax Exempt)</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Business Type</span>:</li>
<li>For Profit Organization</li>
<li>Manufacturer of Goods</li>
</ul>
<ul>
<li><span style="font-weight: bold;">Industries Served</span>: All Other Miscellaneous Fabricated Metal Product Manufacturing, All Other Miscellaneous General Purpose Machinery Manufacturing</li>
</ul>
<div style="padding-top: 10px;" id="government_funding">
<h2>Sampling of Recent Funding Actions/Set Asides</h2>
<p style="font-style: italic; font-size: .8em;">In order by amount of set aside monies.</p>
<ul>
<li><span style="color: green;">$500,000</span> - Tuesday the 29th of November 2016<br><span style="font-weight: bold; font-size: 1.2em;">Department Of Army</span> <br> W0LX ANNISTON DEPOT PROP DIV<br>IGF::CT::IGF. INCREASE FUNDING FOR THE ABRASIVE CLEAN OF VARIOUS PARTS
<hr>
</li>
</ul>
</div>
</div>
"""
bs = BeautifulSoup(html,'lxml')
for li in bs.find_all('span',text=re.compile('^(Contact Email)|^(Business Type)|^(Phone)|^(Estimated Number of Employees)|^(Estimated Annual Receipts)|^(Contact Person)|^(Industries Served)|^(Department Of Army)')):
print(li.parent.text)