javascript文件如下所示:
states_arr['Chittoor']= new Array( "Kurnool (Abbas Nagar)# 9247001529 # H. No. 80-11/111, ; Beside ICICI Bank ATM, ; Near Krishna Nagar Railway Gate, ; Abbas Nagar, Kurnool.","Kurnool # 9247001530 # H. No. 46/694, Near Annapurna Hotel, Opp. Govt Hospital, Budawarpet, Kurnool. " );
我想从第二个'#'符号后面开始的js文件中的所有数组中提取地址,即“H. No. 80-11 / 111,;在ICICI Bank ATM旁边;;近Krishna Nagar铁路门,; Abbas Nagar,Kurnool。“, “H. No. 46/694,靠近安纳普尔纳酒店,Opp.Govt Hospital,Budawarpet,Kurnool。”
以上完整的javascript文件位于: http://www.heteropharmacy.com/jScript/myScript.js
我正在使用BeautifulSoup,这是我的错误代码:
soup = BeautifulSoup(html_doc)
script = soup.find_all("script")
pattern = re.compile(r" (?<=[0-9]\s#\s).+")
while pattern.search(script):
line1 = pattern.search(script)
print line1
然后需要将此文件转换为json格式。
答案 0 :(得分:0)
你可以用python -
清理文件假设actual_file
是您刚刚打开的js文件
# Split it by newline character and remove all lines which have less than 2
# characters since our addresses are much longer
lines_of_js = [i.strip() for i in actual_file.split("\n") if len(i)>2]
# Now, remove lines with syntaxes of javascript and keep lines which have
# `#` in address. You may want to revisit this part for further fine tuning.
lines_with_address = [line for line in y if
all([i not in line for i in '(<>={}'])
and
('#' in line)
]
lines_with-address
现在是此类地址的列表
拆分此变量中的每一行,将其拆分为#
并获取最后一项 - 这应该是您的地址:
In [94]: [line.split('#')[-1] for line in lines_with_address]
Out[94]:
[' D.No. 5-9-24/66/1/a, Hill fort, ; Beside MLA quarters, ; Adarsh nagar, Hyderabad",',
' Plot No:23/A, Addagutta, ; Co-Opp.Housing Society Ltd. Opp. JNTU, ; HMT Hills Road, Kukatpally, Hyd",',
' Shop No.5, Plot No.86, ; Road No. 6, Vishnavy Recidency, ; Near AXIS Bank ATM, R.K.Puram,; Alkapuri Colony, Hyderabad.",',
答案 1 :(得分:0)
在Python2.7中测试
你不需要bs4。只需使用urllib2或Py3k等效来读取源代码。
import re
import urllib2
dta = urllib2.urlopen('http://www.heteropharmacy.com/jScript/myScript.js').read()
final = [i[2:].replace('",', '').strip() for i in re.findall(r'# (?:[a-zA-Z]).+', dta)]
示例输出(列表):
'H.No : 3-5-60/C-12 ; Opp. Andhra bank, ; Vivekananda Nagar Colony, ; Kukatpally, Hyderabad',
"Flat No.G-6, Bhavya's Srinivasam, ; Opp. Sanghamitra School, ; Nizampet Road, Hyderabad",
'H.No. 8-2-603/B/28, ; Opp. Hyderabad kababs, ; Road No-10, Banjara hills, Hyderabad',
'H. No.1-55/C/9&10, Shop No. 9 & 10, ; Raghava Towers, Main Road, ; Madinaguda',
'beside Cyberabad Police Commisionarate Office, ; Telecom Nagar, Gachibowli, ; Hyderabad"'