Question

我是python的新手，我有一个HTML文本文件，我想用python 2.7进行搜索。

以下代码只是一个公司信息的示例。在完整的html文本文件中，代码结构对于所有其他公司也是相同的，并且位于彼此之下（如果后者信息有帮助）。

基本上，我想按时间顺序提取某些信息（如公司名称，位置，电话号码和网站），以便将数据分配给正确的组织，如下所示：

Liberty Associates LLC | New York    | +1 973-344-8300 | www.liberty.edu
Company B              | Los Angeles | +1 213-802-1770 | perchla.com

对不起，如果我不够简洁，但有关如何启动脚本及其外观的任何建议都会非常有用！

代码：

＆＃13;

<body><div class="tab_content-wrapper noPrint"><div class="tab_content_card">
            <div class="card-header">
                <strong title="" d.="" kon.="" nl="">"Liberty Associates LLC"</strong>
                <span class="tel" title="Phone contacts">Phone contacts</span>
			
            </div>
            <div class="card-content">
                
				
                <table>
                    <tbody>
                        <tr>
                            <td colspan="4">
                                
                                <label class="downdrill-sbi" title="Industry: Immigration">Industry: Immigration</label>
                            </td>
                        </tr>
                        <tr>
                            <td width="20">&nbsp;</td>
                            <td width="245">&nbsp;</td>
                            <td width="50">&nbsp;</td>
                            <td width="80">&nbsp;</td>
                        </tr>
                        <tr>
                            <td colspan="2">
59 Wall St</td>
                            <td></td>
                            <td></td>
                        </tr>
                        <tr>
                            <td colspan="2">NJ 07105&nbsp;&nbsp;
                                
                                <label class="downdrill-sbi" title="New York">New York</label>
                            </td>
                            <td></td>
                            <td></td>
                        </tr>
                        <tr>
                            <td>&nbsp;</td>
                            <td>&nbsp;</td>
                            <td>&nbsp;</td>
                            <td>&nbsp;</td>
                        </tr>
                        <tr><td>Phone:</td><td>+1 973-344-8300</td><td>Firm Nr:</td><td>KL4568TL</td></tr>
                        <tr><td>Fax:</td><td>+1 973-344-8300</td><td colspan="2"></td></tr>
                        <tr>
                            <td colspan="2"> <a href="http://www.liberty.edu/" target="_blank">www.liberty.edu</a> </td>
                            <td>Active:</td>
                            <td>Yes</td>
                        </tr>
                    </tbody>
                </table>
            </div>
            

        </div></div></body>

＆＃13;

在网页上看起来如何：

修改

所以在ajputnam的帮助下，我现在得到了这个：

from lxml import html    

str = open('test_html.txt', 'r').read()
tree = html.fromstring(str)

name = tree.xpath("/html/body/div/div/div[1]/strong/text()")
place = tree.xpath("/html/body/div/div/div[2]/table/tbody/tr[4]/td[1]/label/text()")
phone = tree.xpath("/html/body/div/div/div[2]/table/tbody/tr[6]/td[2]/text()")
url = tree.xpath("/html/body/div/div/div[2]/table/tbody/tr[8]/td[1]/a/text()")

print(name, place, phone, url)

打印：

(['"Liberty Associates LLC"'], ['New York'], ['+1 973-344-8300'], ['www.liberty.edu'])

然而，当我在整个html文件（有多个公司数据）上尝试这个代码时，我得到的所有匹配变量都在彼此之后。我如何正确使用[0]来获取这样的结构数据？：

Liberty Associates LLC | New York    | +1 973-344-8300 | www.liberty.edu
Company B              | Los Angeles | +1 213-802-1770 | perchla.com

Answer 1

首先，您需要从页面获取HTML。您可以使用类似请求的库来执行此操作。

from lxml import html
import requests

page = requests.get('url')
tree = html.fromstring(page.content)

然后，您可以使用选择器访问“树”中的内容。

prices = tree.xpath('//span[@class="item-price"]/text()')

或者你可以正常解析字符串。

请参阅：HTML scrapping

从文件中读取

from lxml import html

# read html as string from file
str = open('file.html', 'r').read()
tree = html.fromstring(str)

company = tree.xpath('//div[@class="card-header"]/strong/text()')
print company

如何使用python从HTML页面中提取特定数据？

1 个答案: