Python使用split从HTML中提取数据

时间:2013-02-23 05:12:09

标签: python html-parsing

从网址检索到的某个页面具有以下语法:

<p>
    <strong>Name:</strong> Pasan <br/>
    <strong>Surname: </strong> Wijesingher <br/>                    
    <strong>Former/AKA Name:</strong> No Former/AKA Name <br/>                    
    <strong>Gender:</strong> Male <br/>
    <strong>Language Fluency:</strong> ENGLISH <br/>                    
</p>

我想在Name,Surname等中提取数据(我必须为许多页面重复此任务)

为此,我尝试使用以下代码:

import urllib2

url = 'http://www.my.lk/details.aspx?view=1&id=%2031'
source = urllib2.urlopen(url)

start = '<p><strong>Given Name:</strong>'
end = '<strong>Surname'
givenName=(source.read().split(start))[1].split(end)[0]

start = 'Surname: </strong>'
end = 'Former/AKA Name'
surname=(source.read().split(start))[1].split(end)[0]

print(givenName)
print(surname)

当我调用source.read.split方法时,它只运行一次。但是当我使用它两次时,它会给出一个超出范围错误的列表索引。

有人可以建议解决方案吗?

4 个答案:

答案 0 :(得分:5)

您可以使用BeautifulSoup来解析HTML字符串。

以下是您可能尝试的一些代码,
它使用BeautifulSoup(获取html代码生成的文本),然后解析字符串以提取数据。

from bs4 import BeautifulSoup as bs

dic = {}
data = \
"""
    <p>
        <strong>Name:</strong> Pasan <br/>
        <strong>Surname: </strong> Wijesingher <br/>                    
        <strong>Former/AKA Name:</strong> No Former/AKA Name <br/>                    
        <strong>Gender:</strong> Male <br/>
        <strong>Language Fluency:</strong> ENGLISH <br/>                    
    </p>
"""

soup = bs(data)
# Get the text on the html through BeautifulSoup
text = soup.get_text()

# parsing the text
lines = text.splitlines()
for line in lines:
    # check if line has ':', if it doesn't, move to the next line
    if line.find(':') == -1: 
        continue    
    # split the string at ':'
    parts = line.split(':')

    # You can add more tests here like
    # if len(parts) != 2:
    #     continue

    # stripping whitespace
    for i in range(len(parts)):
        parts[i] = parts[i].strip()    
    # adding the vaules to a dictionary
    dic[parts[0]] = parts[1]
    # printing the data after processing
    print '%16s %20s' % (parts[0],parts[1])

小贴士: 如果您打算使用BeautifulSoup来解析HTML,请执行 您应该拥有某些属性,例如class=inputid=10,也就是说,您将所有相同类型的标记保持为相同的ID或类。


<强>更新
您的评论,请参阅下面的代码 它应用上面的提示,使生活(和编码)更容易

from bs4 import BeautifulSoup as bs

c_addr = []
id_addr = []
data = \
"""
<h2>Primary Location</h2>
<div class="address" id="10">
    <p>
       No. 4<br>
       Private Drive,<br>
       Sri Lanka&nbsp;ON&nbsp;&nbsp;K7L LK <br>
"""
soup = bs(data)

for i in soup.find_all('div'):
    # get data using "class" attribute
    addr = ""
    if i.get("class")[0] == u'address': # unicode string
        text = i.get_text()
        for line in text.splitlines(): # line-wise
            line = line.strip() # remove whitespace
            addr += line # add to address string
        c_addr.append(addr)

    # get data using "id" attribute
    addr = ""
    if int(i.get("id")) == 10: # integer
        text = i.get_text()
        # same processing as above
        for line in text.splitlines():
            line = line.strip()
            addr += line
        id_addr.append(addr)

print "id_addr"
print id_addr
print "c_addr"
print c_addr

答案 1 :(得分:4)

您正在调用read()两次。那就是问题所在。而不是这样做你想要调用read一次,将数据存储在一个变量中,并使用你调用read()的变量。像这样:

fetched_data = source.read()

然后......

givenName=(fetched_data.split(start))[1].split(end)[0]

和...

surname=(fetched_data.split(start))[1].split(end)[0]

那应该有用。你的代码不起作用的原因是因为read()方法是第一次读取内容,但是在读完之后它正在查看内容的结尾。下次调用read()时,它不再有剩余内容并抛出异常。

查看urllib2methods on file objects

的文档

答案 2 :(得分:1)

如果你想快速,正则表达式对这类任务更有用。起初它可能是一个严峻的学习曲线,但正则表达式将有一天挽救你的屁股。

试试这段代码:

# read the whole document into memory
full_source = source.read()  

NAME_RE = re.compile('Name:.+?>(.*?)<')
SURNAME_RE = re.compile('Surname:.+?>(.*?)<')

name = NAME_RE.search(full_source, re.MULTILINE).group(1).strip()
surname = SURNAME_RE.search(full_source, re.MULTILINE).group(1).strip()

有关如何在python中使用正则表达式的详细信息,请参阅here

更全面的解决方案将涉及解析HTML(使用像BeautifulSoup这样的库),但根据您的特定应用程序,这可能是过度的。

答案 3 :(得分:0)

您可以使用HTQL:

page="""
<p>
    <strong>Name:</strong> Pasan <br/>
    <strong>Surname: </strong> Wijesingher <br/>                    
    <strong>Former/AKA Name:</strong> No Former/AKA Name <br/>                    
    <strong>Gender:</strong> Male <br/>
    <strong>Language Fluency:</strong> ENGLISH <br/>                    
</p>
"""

import htql
print(htql.query(page, "<p>.<strong> {a=:tx; b=:xx} "))

# [('Name:', ' Pasan '), 
#  ('Surname: ', ' Wijesingher '), 
#  ('Former/AKA Name:', ' No Former/AKA Name '), 
#  ('Gender:', ' Male '), 
#  ('Language Fluency:', ' ENGLISH ')
# ]