从html检索文本不适用于python

时间:2018-09-24 10:05:23

标签: python html

我正在尝试获取公司的联系信息,除了电话号码以外,我已经可以获得所有其他信息。这是html

<ul>
    <li>
        <h3>Harrrrrell   INC</h3>
    </li>
    <li>43 Airpark Ct</li>
    <li>Alabaster, MD 35107</li>
    <li><span style="font-weight: bold;">Phone</span>: 888-232-8358</li>
    <li><span style="font-weight: bold;">Corporate URL</span>: <a href="http://www.hhsales.com" rel="nofollow" target="new">www.h23hsales.com</a></li>
    <li><span style="font-weight: bold;">More Detail</span>:<br> <a href="https://www.collierreporting.com/company/harrell-and-hall-enterprises-inc-alabaster-al">Click for Full Harrell &amp; Hall Enterprises INC Dossier</a></li>
</ul>

此python脚本适用于此html中除电话号码之外的所有其他内容。

for companyLIST in result[0:]:
            try:

                companyname = companyLIST.find('h3').contents[0]
                print("Company Name ",str(companyname) )
            except Exception as e:
                print("errror",str(e))

            try:
                companySt = companyLIST.find_all('li')[1].contents[0]
                print("Company St ",str(companySt) )
            except Exception as e:
                print("errror",str(e))

            try:
                companyCity = companyLIST.find_all('li')[2].contents[0]
                print("Company City ",str(companyCity) )
            except Exception as e:
                print("errror",str(e))

            try:
                companyPhone= companyLIST.find('li')[3].contents[0]
                print("Company Phone ",companyPhone )

            except Exception as e:
                print("errror",str(e))

            try:
                companyWeb = companyLIST.find('a')['href'] 

                print("Company Web ",str(companyWeb) )
                print("  " )

            except Exception as e:
                print("errror",str(e))

这是示例输出

公司名称Harrrrrell INC

Company St 43 Airpark Ct

公司城市雪花石膏,MD 35107

错误3

公司网站https://www.collierreporting.com/company/harrell-and-hall-enterprises-inc-alabaster-al

  

回溯(最近通话最近):

  File "sample.py", line 26, in <module>
    companyPhone = soup.find('li')[3].contents[0]
  File "...dist-packages/bs4/element.py", line 1011, in __getitem__
    return self.attrs[key]
KeyError: 3

如何重写下面的代码以获得电话号码?

companyPhone= companyLIST.find('li')[3].contents[0]
                print("Company Phone ",companyPhone )

2 个答案:

答案 0 :(得分:1)

我想您正在使用beatifulsoup4库来解析HTML。如果是,您可以像这样从html获取电话号码:

text = soup.find_all('li')[3].contents[1]
phone_number = re.sub(": ", "", text)

print(phone_number)

答案 1 :(得分:0)

替换

    mainUrl = "http://www.mywebsite.com/mypath/to/folder";
    urlParts = /^(?:\w+\:\/\/)?([^\/]+)(.*)$/.exec(mainUrl);
    host = Fragment[1]; // www.mywebsite.com

使用

companyPhone= companyLIST.find('li')[3].contents[0]
            print("Company Phone ",companyPhone )

上面的代码用':'字符分割列表,选择最后一个元素,并删除无用的信息。最后,我们只有电话号码作为单独的字符串。您可以对其余各行执行相同的操作,只需明智地选择分割字符/字符串,然后使用replace函数清理结果列表元素即可。

希望它有用。