BeautifulSoup在“([{}]}”中找到内容

时间:2016-03-20 05:22:50

标签: python html beautifulsoup web-crawler

这是我的html文件。

 /**
     * This method returns the list of removable storage and sdcard paths.
     * I have no USB OTG so can not test it. Is anybody can test it, please let me know
     * if working or not. Assume 0th index will be removable sdcard path if size is
     * greater than 0.
     * @return the list of removable storage paths.
     */
    public static HashSet<String> getExternalPaths()
    {
    final HashSet<String> out = new HashSet<String>();
    String reg = "(?i).*vold.*(vfat|ntfs|exfat|fat32|ext3|ext4).*rw.*";
    String s = "";
    try
    {
        final Process process = new ProcessBuilder().command("mount").redirectErrorStream(true).start();
        process.waitFor();
        final InputStream is = process.getInputStream();
        final byte[] buffer = new byte[1024];
        while (is.read(buffer) != -1)
        {
            s = s + new String(buffer);
        }
        is.close();
    }
    catch (final Exception e)
    {
        e.printStackTrace();
    }

    // parse output
    final String[] lines = s.split("\n");
    for (String line : lines)
    {
        if (!line.toLowerCase(Locale.US).contains("asec"))
        {
            if (line.matches(reg))
            {
                String[] parts = line.split(" ");
                for (String part : parts)
                {
                    if (part.startsWith("/"))
                    {
                        if (!part.toLowerCase(Locale.US).contains("vold"))
                        {
                            out.add(part.replace("/media_rw","").replace("mnt", "storage"));
                        }
                    }
                }
            }
        }
    }
    //Phone's external storage path (Not removal SDCard path)
    String phoneExternalPath = Environment.getExternalStorageDirectory().getPath();

    //Remove it if already exist to filter all the paths of external removable storage devices
    //like removable sdcard, USB OTG etc..
    //When I tested it in ICE Tab(4.4.2), Swipe Tab(4.0.1) with removable sdcard, this method includes
    //phone's external storage path, but when i test it in Moto X Play (6.0) with removable sdcard,
    //this method does not include phone's external storage path. So I am going to remvoe the phone's
    //external storage path to make behavior consistent in all the phone. Ans we already know and it easy
    // to find out the phone's external storage path.
    out.remove(phoneExternalPath);

    return out;
}

使用我从这个question中学到的代码:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<td id="cur_o3" class="tdcur" style="font-weight:bold;font-size:11px;" align="center">2</td>
</script><script type="text/javascript">
try { 
if (isMapOpened == "open") {
mapInitWithData([{"aqi":"294","city":"D\u014dngru\u01cen, Shenyang","x":1249,"g":["41.7089","123.439"]},{"aqi":"263","city":"Liaoyang","extra":1,"x":4347,"g":["41.267244","123.236944"]},{"aqi":"263","city":"Ch\u00e9nli\u00e1ox\u012b l\u00f9, Shenyang","x":8755,"g":["41.7347","123.2444"]},{"aqi":"255","city":"Tieling","extra":1,"x":4346,"g":["42.22297","123.726163"]},{"aqi":"249","city":"h\u00fan n\u00e1n d\u014dng l\u00f9, Shenyang , Shenyang","x":5218,"g":["41.7561","123.535"]},{"aqi":"238","city":"Shenyang US Consulate","lvl":1,"x":496,"g":["41.7832349","123.4267266"]},{"aqi":"238","city":"Xiaoheyan, Shenyang","x":1254,"g":["41.7775","123.478"]},{"aqi":"219","city":"Liaoning University, Shenyang","x":1257,"g":["41.9228","123.3783"]},{"aqi":"193","city":"wenhua street, Shenyang , Shenyang","x":5215,"g":["41.765","123.41"]},{"aqi":"191","city":"Shenyang","x":1473,"g":["41.805698","123.431475"]},{"aqi":"191","city":"Taiyuan Street, Shenyang","x":1255,"g":["41.7972","123.3997"]},{"aqi":"189","city":"Shenfu new town, Fushun","x":4355,"g":["41.8417","123.7117"]},{"aqi":"188","city":"Wanghua district, Fushun , Fushun","extra":1,"x":5240,"g":["41.8469","123.8100"]},{"aqi":"188","city":"Fushun","extra":1,"x":1476,"g":["41.880872","123.957208"]},{"aqi":"188","city":"j\u012bnsh\u0101 ji\u0101ng l\u00f9 b\u011bi, Tieling , Tieling","extra":1,"x":5203,"g":["42.2217","123.7153"]},{"aqi":"182","city":"Tanglin Road , Shenyang , Shenyang","x":5216,"g":["41.8336","123.542"]},{"aqi":"179","city":"Caitun, Benxi","extra":1,"x":4364,"g":["41.3047","123.7308"]},{"aqi":"176","city":"Xihu, Benxi","extra":1,"x":4365,"g":["41.3369","123.7528"]},{"aqi":"172","city":"Xinfu district, Fushun , Fushun","extra":1,"x":5237,"g":["41.8594","123.9000"]},{"aqi":"170","city":"Weining, Benxi","extra":1,"x":4361,"g":["41.3472","123.8142"]},{"aqi":"162","city":"Shuncheng district, Fushun , Fushun","extra":1,"x":5239,"g":["41.883375","123.94504"]},{"aqi":"161","city":"y\u00f9n\u00f3ng l\u00f9, Shenyang","x":8756,"g":["41.9086","123.5953"]},{"aqi":"151","city":"Dongzhou district, Fushun , Fushun","extra":1,"x":5238,"g":["41.8625","124.0383"]},{"aqi":"122","city":"Dahuofang reservoir, Fushun , Fushun","extra":1,"x":5236,"g":["41.8864","124.0878"]}]/* 24 points -> 24 points */); 

"""  

我可以在soup = beautiful_soup(html_doc) soup.find("td",id="cur_o3",class_="tdcur").get_text() 内获取值。

更重要的是,我想从<td id="cur_o3" class="tdcur" style="font-weight:bold;font-size:11px;" align="center">2</td>部分获取所有"city" "g"

  • soup.script:区域名称
  • city:[“41.7089”,“123.439”]经度和纬度。

我怎样才能实现这一目标?希望得到你的帮助!

2 个答案:

答案 0 :(得分:1)

不幸的是,你必须采取艰难的方式,包括手动解析BeautifulSoup试图远离你。但是,在您的情况下,这很简单:

  • 使用BeautifulSoup获取<script>标记的内部文本。
  • 在该字符串中找到mapInitWithData(的位置
  • 找到]}]的位置
  • 在第一个字符串之后剪切所有内容,最后包括第二个字符串
  • 使用json.loads()解析JSON
  • 无论你需要什么,都可以从字典中获取

听起来很难看?并不是的。 Web抓取总是具有启发性,无论您依赖于HTML文档的结构还是JavaScript函数的代码结构,它都没有多大区别。当网站所有者决定更改网站时,您无论如何都必须重做。

为lulz编码:

from bs4 import BeautifulSoup
import json

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<td id="cur_o3" class="tdcur" style="font-weight:bold;font-size:11px;" align="center">2</td>
<script type="text/javascript">
try {
if (isMapOpened == "open") {
mapInitWithData([...]}]/* 24 points -> 24 points */);
}}
</script>
"""

soup= BeautifulSoup(html_doc, "html.parser")
# usually `try` that but for the moment we let it raise
js = soup.find("script").get_text()
assert len(js) > 0
# here the markers for start and end  of json
from_ = "mapInitWithData("
to_ = "]}]"
index_from = js.find(from_)
assert index_from > 0
index_to = js.find(to_)
assert index_to > 0
j = js[index_from+len(from_):index_to+len(to_)]
data = json.loads(j)
for row in data:
    print row["city"], ":", [float(c) for c in row["g"]] # <g>

答案 1 :(得分:1)

您可以使用正则表达式提取数据:

from bs4 import BeautifulSoup
import re
import json

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<td id="cur_o3" class="tdcur" style="font-weight:bold;font-size:11px;" align="center">2</td>
</script><script type="text/javascript">
try { 
if (isMapOpened == "open") {
mapInitWithData([{"aqi":"294","city":"D\u014dngru\u01cen, Shenyang","x":1249,"g":["41.7089","123.439"]},{"aqi":"263","city":"Liaoyang","extra":1,"x":4347,"g":["41.267244","123.236944"]},{"aqi":"263","city":"Ch\u00e9nli\u00e1ox\u012b l\u00f9, Shenyang","x":8755,"g":["41.7347","123.2444"]},{"aqi":"255","city":"Tieling","extra":1,"x":4346,"g":["42.22297","123.726163"]},{"aqi":"249","city":"h\u00fan n\u00e1n d\u014dng l\u00f9, Shenyang , Shenyang","x":5218,"g":["41.7561","123.535"]},{"aqi":"238","city":"Shenyang US Consulate","lvl":1,"x":496,"g":["41.7832349","123.4267266"]},{"aqi":"238","city":"Xiaoheyan, Shenyang","x":1254,"g":["41.7775","123.478"]},{"aqi":"219","city":"Liaoning University, Shenyang","x":1257,"g":["41.9228","123.3783"]},{"aqi":"193","city":"wenhua street, Shenyang , Shenyang","x":5215,"g":["41.765","123.41"]},{"aqi":"191","city":"Shenyang","x":1473,"g":["41.805698","123.431475"]},{"aqi":"191","city":"Taiyuan Street, Shenyang","x":1255,"g":["41.7972","123.3997"]},{"aqi":"189","city":"Shenfu new town, Fushun","x":4355,"g":["41.8417","123.7117"]},{"aqi":"188","city":"Wanghua district, Fushun , Fushun","extra":1,"x":5240,"g":["41.8469","123.8100"]},{"aqi":"188","city":"Fushun","extra":1,"x":1476,"g":["41.880872","123.957208"]},{"aqi":"188","city":"j\u012bnsh\u0101 ji\u0101ng l\u00f9 b\u011bi, Tieling , Tieling","extra":1,"x":5203,"g":["42.2217","123.7153"]},{"aqi":"182","city":"Tanglin Road , Shenyang , Shenyang","x":5216,"g":["41.8336","123.542"]},{"aqi":"179","city":"Caitun, Benxi","extra":1,"x":4364,"g":["41.3047","123.7308"]},{"aqi":"176","city":"Xihu, Benxi","extra":1,"x":4365,"g":["41.3369","123.7528"]},{"aqi":"172","city":"Xinfu district, Fushun , Fushun","extra":1,"x":5237,"g":["41.8594","123.9000"]},{"aqi":"170","city":"Weining, Benxi","extra":1,"x":4361,"g":["41.3472","123.8142"]},{"aqi":"162","city":"Shuncheng district, Fushun , Fushun","extra":1,"x":5239,"g":["41.883375","123.94504"]},{"aqi":"161","city":"y\u00f9n\u00f3ng l\u00f9, Shenyang","x":8756,"g":["41.9086","123.5953"]},{"aqi":"151","city":"Dongzhou district, Fushun , Fushun","extra":1,"x":5238,"g":["41.8625","124.0383"]},{"aqi":"122","city":"Dahuofang reservoir, Fushun , Fushun","extra":1,"x":5236,"g":["41.8864","124.0878"]}]/* 24 points -> 24 points */); 

"""  

soup = BeautifulSoup(html_doc, 'lxml')
script = soup.script.get_text()
map_search = re.search('mapInitWithData\((.*)\/\*.*', script)
mapData = map_search.group(1)
mapDataObj = json.loads(mapData)[0]
print mapDataObj["city"]
print mapDataObj["g"]