如何使用美丽的汤从javascript数组中提取数据?

时间:2015-05-28 08:12:16

标签: javascript python json beautifulsoup

javascript文件如下所示:

states_arr['Chittoor']= new Array(  "Kurnool (Abbas Nagar)# 9247001529             # H. No. 80-11/111, ; Beside ICICI Bank ATM, ;  Near Krishna Nagar Railway Gate, ; Abbas Nagar,  Kurnool.","Kurnool # 9247001530 # H. No. 46/694, Near Annapurna Hotel, Opp. Govt Hospital, Budawarpet, Kurnool. "  );

我想从第二个'#'符号后面开始的js文件中的所有数组中提取地址,即“H. No. 80-11 / 111,;在ICICI Bank ATM旁边;;近Krishna Nagar铁路门,; Abbas Nagar,Kurnool。“,                                     “H. No. 46/694,靠近安纳普尔纳酒店,Opp.Govt Hospital,Budawarpet,Kurnool。”

以上完整的javascript文件位于: http://www.heteropharmacy.com/jScript/myScript.js

我正在使用BeautifulSoup,这是我的错误代码:

soup = BeautifulSoup(html_doc)
script = soup.find_all("script")
pattern = re.compile(r" (?<=[0-9]\s#\s).+")
while pattern.search(script):
    line1 = pattern.search(script)
    print line1

然后需要将此文件转换为json格式。

2 个答案:

答案 0 :(得分:0)

你可以用python -

清理文件

假设actual_file是您刚刚打开的js文件

# Split it by newline character and remove all lines which have less than 2 
# characters since our addresses are much longer
lines_of_js = [i.strip() for i in actual_file.split("\n") if len(i)>2]

# Now, remove lines with syntaxes of javascript and keep lines which have
# `#` in address. You may want to revisit this part for further fine tuning.
lines_with_address = [line for line in y if 
                            all([i not in line for i in '(<>={}'])
                            and 
                            ('#' in line)
                            ]

lines_with-address现在是此类地址的列表

拆分此变量中的每一行,将其拆分为#并获取最后一项 - 这应该是您的地址:

 In [94]: [line.split('#')[-1] for line in lines_with_address]
Out[94]: 
[' D.No. 5-9-24/66/1/a, Hill fort, ; Beside MLA quarters, ; Adarsh nagar, Hyderabad",',
 ' Plot No:23/A, Addagutta, ; Co-Opp.Housing Society Ltd. Opp. JNTU, ; HMT Hills Road, Kukatpally, Hyd",',
 ' Shop No.5, Plot No.86, ; Road No. 6, Vishnavy Recidency, ; Near AXIS Bank ATM, R.K.Puram,; Alkapuri Colony, Hyderabad.",',

答案 1 :(得分:0)

在Python2.7中测试

你不需要bs4。只需使用urllib2或Py3k等效来读取源代码。

import re
import urllib2

dta = urllib2.urlopen('http://www.heteropharmacy.com/jScript/myScript.js').read()
final = [i[2:].replace('",', '').strip() for i in re.findall(r'# (?:[a-zA-Z]).+', dta)]

示例输出(列表):

'H.No : 3-5-60/C-12 ; Opp. Andhra bank, ; Vivekananda Nagar Colony, ; Kukatpally, Hyderabad', "Flat No.G-6, Bhavya's Srinivasam, ; Opp. Sanghamitra School, ; Nizampet Road, Hyderabad", 'H.No. 8-2-603/B/28, ; Opp. Hyderabad kababs, ; Road No-10, Banjara hills, Hyderabad', 'H. No.1-55/C/9&10, Shop No. 9 & 10, ; Raghava Towers, Main Road, ; Madinaguda', 'beside Cyberabad Police Commisionarate Office, ; Telecom Nagar, Gachibowli, ; Hyderabad"'