无法成功从网站html提取文本

时间:2018-09-25 17:37:00

标签: python html beautifulsoup

我需要从公共站点抓取小型企业信息

这是html格式

<div class="listings">
    <ul>
        <li>
            <h3>Machine Machine Company Inc</h3>
        </li>
        <li><a href="#government_funding" style="font-size:.8em;">View funding actions</a></li>
        <li>Alexandria, AL 36250</li>
        <li><span style="font-weight: bold;">Phone</span>: 256-830-3440</li>
        <li><span style="font-weight: bold;">Estimated Number of Employees</span>: 64</li>
        <li><span style="font-weight: bold;">Estimated Annual Receipts</span>: $9,691,383</li>
        <li><span style="font-weight: bold;">Business Start Date</span>: 1971</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Contact Person</span>: James HOland</li>
        <li><span style="font-weight: bold;">Contact Phone</span>: 256-820-3440</li>
        <li><span style="font-weight: bold;">Contact Email</span>: hhx@cableone.net</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Business Structure</span>:</li>
        <li>Corporate Entity (Not Tax Exempt)</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Business Type</span>:</li>
        <li>For Profit Organization</li>
        <li>Manufacturer of Goods</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Industries Served</span>: All Other Miscellaneous Fabricated Metal Product Manufacturing, All Other Miscellaneous General Purpose Machinery Manufacturing</li>
    </ul>
    <div style="padding-top: 10px;" id="government_funding">
        <h2>Sampling of Recent  Funding Actions/Set Asides</h2>
        <p style="font-style: italic; font-size: .8em;">In order by amount of set aside monies.</p>
        <ul>
            <li><span style="color: green;">$500,000</span> - Tuesday the 29th of November 2016<br><span style="font-weight: bold; font-size: 1.2em;">Department Of Army</span> <br> W0LX ANNISTON DEPOT PROP DIV<br>IGF::CT::IGF. INCREASE FUNDING FOR THE ABRASIVE CLEAN OF VARIOUS PARTS
                <hr>
            </li>
        </ul>
    </div>
</div>

我关于如何提取数据的计划是将所有“ ul”标签放入容器中,然后根据索引号遍历容器中的所有ul,以找到所需的文本(例如电子邮件)。所以我有这个python脚本试图检索电子邮件地址:

companydriver.get(weburl)

businessesoup = BeautifulSoup(companydriver.page_source,"html5lib");

#GET BUSINESS DATA
businesscontainer = businessesoup.find_all("ul")

dataresult = [c for c in businesscontainer]

print(colorama.Fore.BLUE +  str(dataresult))

for idx, datacell in enumerate(dataresult, start=0):
    # arraylenght = dataresult.lenght
    # print("this is dataresult", dataresult)
    print("Index ", str(idx))
    print(colorama.Fore.RED +'This is data cell',str(datacell))
    print(" ")

    if (idx == 1)  :
        emailaddress = dataresult.find("span").text
        print(colorama.Fore.GREEN + str(emailaddress))

问题是我似乎无法获得电子邮件地址。

我需要提取以下项目:

  • 电话
  • 员工人数
  • 预计的年度收据
  • 联系人
  • 联系电子邮件
  • 所服务的行业
  • 陆军部

我如何轻松提取电子邮件地址和其余地址?

3 个答案:

答案 0 :(得分:1)

您现在要执行的操作将不起作用,因为您正在从[PXOverride] public virtual void DoCreateSalesOrder(OpportunityMaint.CreateSalesOrderFilter param, Action<OpportunityMaint.CreateSalesOrderFilter> del) { PXGraph.InstanceCreated.AddHandler<SOOrderEntry>(graph => { graph.RowInserting.AddHandler<SOLine>((cache, args) => { var soLine = (SOLine)args.Row; if (soLine == null) { return; } CROpportunityProducts opProduct = PXResult<CROpportunityProducts>.Current; if (opProduct == null) { return; } var opProductExt = PXCache<CROpportunityProducts>.GetExtension<CROpportunityProductsExt>(opProduct); var soLineExt = PXCache<SOLine>.GetExtension<SOLineExt>(soLine); //Copy all extension fields here... }); }); del(param); } 元素中提取文本,而所要获取的信息位于<span>元素中的<li>元素中包含。我建议您执行以下操作:

对于每个<span>元素:

  • 检查它是否包含<li>元素,如果包含,则该元素的文本是什么。
  • 如果确实存在<span>元素,例如带有文本“ Contact Email”,则您知道<span>元素包含所需的信息。
  • 如果找到包含所需信息的<li>元素,则可以提取其文本内容。这可能还包含(例如)“联系电子邮件”文本,因此您将需要进行一些后处理,但这并不是整个任务中最困难的部分。

编辑:代码

根据您的代码,您可能会执行以下操作来提取电子邮件地址(注意:不能保证正常工作,但这不是重点)

<li>

答案 1 :(得分:1)

您可以尝试直接将文本用作find_all参数。 https://www.crummy.com/software/BeautifulSoup/bs4/doc/

示例:

strings_to_search_for = ["Phone", "Estimated Number of Employees"]

businesscontainer = businessesoup.find_all(string=strings_to_search_for )
for element in businesscontainer:
   value = element.parent.text  # get <li> value
   # do something ...

希望有帮助。

答案 2 :(得分:0)

您可以使用RE查找所需的字符串,然后获取该对象的父对象:

修改

说明:使用text = re.recomiple命令,我们可以将regex表达式应用于我们漂亮的汤对象的文本值。在这种情况下,我们对span标签感兴趣。因此,由于我们知道html中的文本,因此可以通过正则表达式应用多个语句。正则表达式中的^运算符将匹配一个字符串值,()将变为子表达式或匹配组。因此,我将您的每个条件都应用为匹配组,而| (条形)符号作为逻辑或条件。

http://rextester.com/KBB57950

from bs4 import BeautifulSoup
import re

html = """
<div class="listings">
    <ul>
        <li>
            <h3>Machine Machine Company Inc</h3>
        </li>
        <li><a href="#government_funding" style="font-size:.8em;">View funding actions</a></li>
        <li>Alexandria, AL 36250</li>
        <li><span style="font-weight: bold;">Phone</span>: 256-830-3440</li>
        <li><span style="font-weight: bold;">Estimated Number of Employees</span>: 64</li>
        <li><span style="font-weight: bold;">Estimated Annual Receipts</span>: $9,691,383</li>
        <li><span style="font-weight: bold;">Business Start Date</span>: 1971</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Contact Person</span>: James HOland</li>
        <li><span style="font-weight: bold;">Contact Phone</span>: 256-820-3440</li>
        <li><span style="font-weight: bold;">Contact Email</span>: hhx@cableone.net</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Business Structure</span>:</li>
        <li>Corporate Entity (Not Tax Exempt)</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Business Type</span>:</li>
        <li>For Profit Organization</li>
        <li>Manufacturer of Goods</li>
    </ul>
    <ul>
        <li><span style="font-weight: bold;">Industries Served</span>: All Other Miscellaneous Fabricated Metal Product Manufacturing, All Other Miscellaneous General Purpose Machinery Manufacturing</li>
    </ul>
    <div style="padding-top: 10px;" id="government_funding">
        <h2>Sampling of Recent  Funding Actions/Set Asides</h2>
        <p style="font-style: italic; font-size: .8em;">In order by amount of set aside monies.</p>
        <ul>
            <li><span style="color: green;">$500,000</span> - Tuesday the 29th of November 2016<br><span style="font-weight: bold; font-size: 1.2em;">Department Of Army</span> <br> W0LX ANNISTON DEPOT PROP DIV<br>IGF::CT::IGF. INCREASE FUNDING FOR THE ABRASIVE CLEAN OF VARIOUS PARTS
                <hr>
            </li>
        </ul>
    </div>
</div>
"""

bs = BeautifulSoup(html,'lxml')
for li in bs.find_all('span',text=re.compile('^(Contact Email)|^(Business Type)|^(Phone)|^(Estimated Number of Employees)|^(Estimated Annual Receipts)|^(Contact Person)|^(Industries Served)|^(Department Of Army)')):
    print(li.parent.text)