Question

我需要从公共站点抓取小型企业信息

这是html格式

<div class="listings">
    <ul>
        <li>
            <h3>Machine Machine Company Inc</h3>
        </li>
        <li><a href="#government_funding" style="font-size:.8em;">View funding actions</a></li>
        <li>Alexandria, AL 36250</li>
        <li><span style="font-weight: bold;">Phone</span>: 256-830-3440</li>
        <li><span style="font-weight: bold;">Estimated Number of Employees</span>: 64</li>
        <li><span style="font-weight: bold;">Estimated Annual Receipts</span>: $9,691,383</li>
        <li><span style="font-weight: bold;">Business Start Date</span>: 1971</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Contact Person</span>: James HOland</li>
        <li><span style="font-weight: bold;">Contact Phone</span>: 256-820-3440</li>
        <li><span style="font-weight: bold;">Contact Email</span>: hhx@cableone.net</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Business Structure</span>:</li>
        <li>Corporate Entity (Not Tax Exempt)</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Business Type</span>:</li>
        <li>For Profit Organization</li>
        <li>Manufacturer of Goods</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Industries Served</span>: All Other Miscellaneous Fabricated Metal Product Manufacturing, All Other Miscellaneous General Purpose Machinery Manufacturing</li>
    </ul>
    <div style="padding-top: 10px;" id="government_funding">
        <h2>Sampling of Recent  Funding Actions/Set Asides</h2>
        <p style="font-style: italic; font-size: .8em;">In order by amount of set aside monies.</p>
        <ul>
            <li><span style="color: green;">$500,000</span> - Tuesday the 29th of November 2016<br><span style="font-weight: bold; font-size: 1.2em;">Department Of Army</span> <br> W0LX ANNISTON DEPOT PROP DIV<br>IGF::CT::IGF. INCREASE FUNDING FOR THE ABRASIVE CLEAN OF VARIOUS PARTS
                <hr>
            </li>
        </ul>
    </div>
</div>

我关于如何提取数据的计划是将所有“ ul”标签放入容器中，然后根据索引号遍历容器中的所有ul，以找到所需的文本（例如电子邮件）。所以我有这个python脚本试图检索电子邮件地址：

companydriver.get(weburl)

businessesoup = BeautifulSoup(companydriver.page_source,"html5lib");

#GET BUSINESS DATA
businesscontainer = businessesoup.find_all("ul")

dataresult = [c for c in businesscontainer]

print(colorama.Fore.BLUE +  str(dataresult))

for idx, datacell in enumerate(dataresult, start=0):
    # arraylenght = dataresult.lenght
    # print("this is dataresult", dataresult)
    print("Index ", str(idx))
    print(colorama.Fore.RED +'This is data cell',str(datacell))
    print(" ")

    if (idx == 1)  :
        emailaddress = dataresult.find("span").text
        print(colorama.Fore.GREEN + str(emailaddress))

问题是我似乎无法获得电子邮件地址。

我需要提取以下项目：

电话
员工人数
预计的年度收据
联系人
联系电子邮件
所服务的行业
陆军部

我如何轻松提取电子邮件地址和其余地址？

Answer 1

您现在要执行的操作将不起作用，因为您正在从[PXOverride] public virtual void DoCreateSalesOrder(OpportunityMaint.CreateSalesOrderFilter param, Action<OpportunityMaint.CreateSalesOrderFilter> del) { PXGraph.InstanceCreated.AddHandler<SOOrderEntry>(graph => { graph.RowInserting.AddHandler<SOLine>((cache, args) => { var soLine = (SOLine)args.Row; if (soLine == null) { return; } CROpportunityProducts opProduct = PXResult<CROpportunityProducts>.Current; if (opProduct == null) { return; } var opProductExt = PXCache<CROpportunityProducts>.GetExtension<CROpportunityProductsExt>(opProduct); var soLineExt = PXCache<SOLine>.GetExtension<SOLineExt>(soLine); //Copy all extension fields here... }); }); del(param); }元素中提取文本，而所要获取的信息位于<span>元素中的<li>元素中包含。我建议您执行以下操作：

对于每个<span>元素：

检查它是否包含<li>元素，如果包含，则该元素的文本是什么。
如果确实存在<span>元素，例如带有文本“ Contact Email”，则您知道<span>元素包含所需的信息。
如果找到包含所需信息的<li>元素，则可以提取其文本内容。这可能还包含（例如）“联系电子邮件”文本，因此您将需要进行一些后处理，但这并不是整个任务中最困难的部分。

编辑：代码

根据您的代码，您可能会执行以下操作来提取电子邮件地址（注意：不能保证正常工作，但这不是重点）

<li>

Answer 2

您可以尝试直接将文本用作find_all参数。 https://www.crummy.com/software/BeautifulSoup/bs4/doc/

示例：

strings_to_search_for = ["Phone", "Estimated Number of Employees"]

businesscontainer = businessesoup.find_all(string=strings_to_search_for )
for element in businesscontainer:
   value = element.parent.text  # get <li> value
   # do something ...

希望有帮助。

Answer 3

您可以使用RE查找所需的字符串，然后获取该对象的父对象：

修改

说明：使用text = re.recomiple命令，我们可以将regex表达式应用于我们漂亮的汤对象的文本值。在这种情况下，我们对span标签感兴趣。因此，由于我们知道html中的文本，因此可以通过正则表达式应用多个语句。正则表达式中的^运算符将匹配一个字符串值，（）将变为子表达式或匹配组。因此，我将您的每个条件都应用为匹配组，而| （条形）符号作为逻辑或条件。

http://rextester.com/KBB57950

from bs4 import BeautifulSoup
import re

html = """
<div class="listings">
    <ul>
        <li>
            <h3>Machine Machine Company Inc</h3>
        </li>
        <li><a href="#government_funding" style="font-size:.8em;">View funding actions</a></li>
        <li>Alexandria, AL 36250</li>
        <li><span style="font-weight: bold;">Phone</span>: 256-830-3440</li>
        <li><span style="font-weight: bold;">Estimated Number of Employees</span>: 64</li>
        <li><span style="font-weight: bold;">Estimated Annual Receipts</span>: $9,691,383</li>
        <li><span style="font-weight: bold;">Business Start Date</span>: 1971</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Contact Person</span>: James HOland</li>
        <li><span style="font-weight: bold;">Contact Phone</span>: 256-820-3440</li>
        <li><span style="font-weight: bold;">Contact Email</span>: hhx@cableone.net</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Business Structure</span>:</li>
        <li>Corporate Entity (Not Tax Exempt)</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Business Type</span>:</li>
        <li>For Profit Organization</li>
        <li>Manufacturer of Goods</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Industries Served</span>: All Other Miscellaneous Fabricated Metal Product Manufacturing, All Other Miscellaneous General Purpose Machinery Manufacturing</li>
    </ul>
    <div style="padding-top: 10px;" id="government_funding">
        <h2>Sampling of Recent  Funding Actions/Set Asides</h2>
        <p style="font-style: italic; font-size: .8em;">In order by amount of set aside monies.</p>
        <ul>
            <li><span style="color: green;">$500,000</span> - Tuesday the 29th of November 2016<br><span style="font-weight: bold; font-size: 1.2em;">Department Of Army</span> <br> W0LX ANNISTON DEPOT PROP DIV<br>IGF::CT::IGF. INCREASE FUNDING FOR THE ABRASIVE CLEAN OF VARIOUS PARTS
                <hr>
            </li>
        </ul>
    </div>
</div>
"""

bs = BeautifulSoup(html,'lxml')
for li in bs.find_all('span',text=re.compile('^(Contact Email)|^(Business Type)|^(Phone)|^(Estimated Number of Employees)|^(Estimated Annual Receipts)|^(Contact Person)|^(Industries Served)|^(Department Of Army)')):
    print(li.parent.text)

无法成功从网站html提取文本

3 个答案: