如何使用BeautifulSoup从html脚本中提取元素

时间:2018-10-09 22:37:38

标签: python html web-scraping beautifulsoup

我是Python编程的新手,我正在使用BeautifulSoup在智利的县选举部门进行一些网络爬虫。我的问题是:我需要从脚本中提取特定的字符串。经过一番清洁,我得到了这样的东西:

<script type="text/javascript">
    document.writeln("<p align='left' class='cleleccion2008'>");
    document.writeln("&nbsp;&nbsp;&nbsp;&nbsp;<a href='geografico.htm'>&laquo;&nbsp;&nbsp;VOLVER MEN&Uacute;<\/a><br>");
    document.writeln("<\/p>");
    document.writeln("<div class='mapTitle'>REGI&Oacute;N<\/div>");
    document.writeln("<p align='left' class='cleleccion2008'>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'101'+")' >Regi&oacute;n I<\/a><br>");
    document.writeln("<\/p>");
    document.writeln("<br>");
    document.writeln("<div class='mapTitle'>COMUNAS<\/div>");
    document.writeln("<p align='left' class='cleleccion2008'>"); 
    if ( parent.DIR_ANO >= "2004"){
        document.writeln("  &nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2307'+")' >Alto Hospicio<\/a> <br>");
    }
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2101'+")' >Arica<\/a><br>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2102'+")' >Camarones<\/a><br>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2303'+")' >Cami&ntilde;a<\/a><br>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2304'+")' >Colchane<\/a><br>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2202'+")' >General Lagos<\/a><br>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2302'+")' >Huara<\/a><br>");  
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2301'+")' >Iquique<\/a><br>");  
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2305'+")' >Pica<\/a><br>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2306'+")' >Pozo Almonte<\/a><br>");   
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2201'+")' >Putre<\/a><br>");                
    document.writeln("<\/p>"); 
    document.close();                                                               
}
</script>

我要从此脚本的最后12行中提取县名和代码,以创建类似以下内容的

代码,县 阿里卡2101 卡玛洛内斯2102 ... 2201,普特雷

任何帮助将不胜感激。感谢您的答复/阅读。

2 个答案:

答案 0 :(得分:0)

js parser中没有特定的BeautifulSoup,但可以使用regex轻松处理。

import re

text = '''
<script type="text/javascript">
    document.writeln("<p align='left' class='cleleccion2008'>");
    document.writeln("&nbsp;&nbsp;&nbsp;&nbsp;<a 
href='geografico.htm'>&laquo;&nbsp;&nbsp;VOLVER MEN&Uacute;<\/a><br>");
document.writeln("<\/p>");
document.writeln("<div class='mapTitle'>REGI&Oacute;N<\/div>");
document.writeln("<p align='left' class='cleleccion2008'>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'101'+")' >Regi&oacute;n I<\/a><br>");
document.writeln("<\/p>");
document.writeln("<br>");
document.writeln("<div class='mapTitle'>COMUNAS<\/div>");
document.writeln("<p align='left' class='cleleccion2008'>"); 
if ( parent.DIR_ANO >= "2004"){
    document.writeln("  &nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2307'+")' >Alto Hospicio<\/a> <br>");
}
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2101'+")' >Arica<\/a><br>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2102'+")' >Camarones<\/a><br>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2303'+")' >Cami&ntilde;a<\/a><br>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2304'+")' >Colchane<\/a><br>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2202'+")' >General Lagos<\/a><br>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2302'+")' >Huara<\/a><br>");  
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2301'+")' >Iquique<\/a><br>");  
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2305'+")' >Pica<\/a><br>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2306'+")' >Pozo Almonte<\/a><br>");   
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2201'+")' >Putre<\/a><br>");                
document.writeln("<\/p>"); 
document.close();                                                               
}
</script>
'''

result_num = []
result_county = []
result = []

for i in re.findall('"[+]\'(.*?)\'[+]"', text):
    result_num.append(i)
for j in re.findall('\'[ ]>(.*?)<', text):
    if j != '':
        result_county.append(j)

result_county = result_county[2:]
result_num = result_num[2:]

for count in range(len(result_county)):
    result.append(result_county[count] + result_num[count])

print(result)

输出

['Arica2101', 'Camarones2102', 'Cami&ntilde;a2303', 'Colchane2304', 'General Lagos2202', 'Huara2302', 'Iquique2301', 'Pica2305', 'Pozo Almonte2306', 'Putre2201']

答案 1 :(得分:0)

Jihan是部分的一部分,因为BeautifulSoup中没有明确的javascript解析器。您可能仍需要bs4来执行初始解析。正则表达式可以帮助您完成字符串解析,但是我将使用编译后的正则表达式而不是执行re.findall()。使用re.findall()可能会导致许多误报和清理。如果逐行执行正则表达式,则可以更有把握地获取正确的数据并在进行迭代时执行验证。最终还可以使代码更简洁,输出更易于管理。

相反,您可以从页面内容中显式提取<script>标记,并在所需的脚本标记上使用str.splitlines()方法。这会将整个标签拆分为字符串列表。您可能想分割表示JavaScript行终止的;字符,以便即使在处理令人讨厌地粉碎在一起的“优化”(模糊)JavaScript代码的情况下也可以使用。

到那时,您可以在每行上使用编译的(或简单的re.search())正则表达式。这样,您就可以确定逐行匹配。这是代码。

import argparse
import bs4
import re
import requests


def parse_county_codes(soup_object):
    for tag in soup_object:
        tag = str(tag)
        lines = tag.splitlines()
        code_regex = re.compile('"[+]\'(.*?)\'[+]"')
        county_regex = re.compile('\'[ ]>(.*?)<')

        for line in lines:
            county = county_regex.search(line)
            code = code_regex.search(line)
            if county and code:
                print(county.group(1), ':', code.group(1))

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--input-file', dest='in_file', help='Input html')
    parser.add_argument('-u', '--url', dest='url', help='Some url\'s content you want to parse')
    args = parser.parse_args()

    if args.in_file:
        with open(args.in_file) as f:
            html_string = f.read()
            soup = bs4.BeautifulSoup(html_string, 'html.parser')
    elif args.url:
        try:
            # Remember to handle any possible url handling exceptions
            response = requests.get(args.url)
        except Exception as e:
            print("The following exception occurred while requesting the url\n{0}".format(args.url))
            print(e)
            return

        soup = bs4.BeautifulSoup(response.content, 'html.parser')
    else:
        print("Input missing. Please provide -i or -u")
        return

    script_tags = soup.find_all('script')
    parse_county_codes(script_tags)

if __name__ == '__main__':
    main()

此代码的输出如下:

Regi&oacute;n I : 101
Alto Hospicio : 2307
Arica : 2101
Camarones : 2102
Cami&ntilde;a : 2303
Colchane : 2304
General Lagos : 2202
Huara : 2302
Iquique : 2301
Pica : 2305
Pozo Almonte : 2306
Putre : 2201

请注意,有些字符和特殊字符的转义序列在字符串中看起来不合适,但是Jihan提供的当前形式的正则表达式是有效的。如果您想清理输出,那么您将最好地知道如何做到这一点,所以我将由您自己决定。请注意,使用正则表达式时,里程可能会有所不同,并且根据其他网页内容,您可能会遇到其他问题。