我正试图从www.hujjat.org网站上缩短祈祷时间。
这是我感兴趣的区域的html部分(您可能已经注意到,所有4个祈祷的class属性都是相同的):
<table width="100%">
<tbody>
<tr>
<td class="NamaazTimes">
<div class="NamaazTimeName">Fajr</div>
<div class="NamaazTime">04:42</div>
</td>
<td class="NamaazTimes">
<div class="NamaazTimeName">Sunrise</div>
<div class="NamaazTime">06:32</div>
</td>
<td class="NamaazTimes">
<div class="NamaazTimeName">Zohr</div>
<div class="NamaazTime">13:02</div>
</td>
<td class="NamaazTimes">
<div class="NamaazTimeName">Maghrib</div>
<div class="NamaazTime">19:33</div>
</td>
</tr>
</tbody>
</table>
到目前为止,我已经编写了以下代码:
# import libraries
import json
import urllib2
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.hujjat.org/'
# query the website and return the html to the variable 'page'
page = urllib2.urlopen(quote_page)
# parse the html using beautiful soap and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
table = soup.find("div",class_="NamaazTimeName", text="Fajr").find_previous("table")
for row in table.find_all("tr"):
a = row.find_all("td")
# print(row.find_all("td"))
print (a)
我的结果是:
[<td class="NamaazTimes">\n<div class="NamaazTimeName">Fajr</div>\n<div class="NamaazTime">04:42</div>\n</td>, <td class="NamaazTimes">\n<div class="NamaazTimeName">Sunrise</div>\n<div class="NamaazTime">06:32</div>\n</td>, <td class="NamaazTimes">\n<div class="NamaazTimeName">Zohr</div>\n<div class="NamaazTime">13:02</div>\n</td>, <td class="NamaazTimes">\n<div class="NamaazTimeName">Maghrib</div>\n<div class="NamaazTime">19:33</div>\n</td>]
我想从代码中得到的只是每次祈祷的时间,例如如果是“ Fajr”祈祷,则输出应为“ 04:42”。然后,我想将此“ 04:42”保存在文本文件中。
有人可以帮我吗?
谢谢。
答案 0 :(得分:1)
这有效:
from bs4 import BeautifulSoup
import requests
url = 'https://www.hujjat.org/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
path = 'C:/Users/John/Documents/Python/'
namaazNames = soup.select('div.NamaazTimeName')
namaazNames = [namaazName.text for namaazName in namaazNames]
namaazTimes = soup.select('div.NamaazTime')
namaazTimes = [namaazTime.text for namaazTime in namaazTimes]
del namaazNames[1]
del namaazTimes[1]
for namaazName, namaazTime in zip(namaazNames, namaazTimes):
with open(path + namaazName + '.txt', 'w') as file:
file.write(namaazTime)
答案 1 :(得分:0)
我建议您使用 select 而不是find,以使查询更类似于浏览器的css选择器。这样,您可以将所有内部文本都放在同一列表中,然后从那里开始工作。
类似的方法应该可以帮助您
# import libraries
import json
import urllib2
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.hujjat.org/'
# query the website and return the html to the variable 'page'
page = urllib2.urlopen(quote_page)
# parse the html using beautiful soap and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
table = soup.find("div",class_="NamaazTimeName", text="Fajr").find_previous("table")
texts = [x.text for x in table.select("td.NamaazTimes div")]
only_times = [texts[x+1] for x in range(0, len(texts), 2)]
# we'll open the file in a with block, so we don't need to close it
with open("foo.txt", "w") as fp:
# you'll need to iterate each string
for row in only_times:
fp.write(row + "\n")
EDIT(2):在代码中重新表达了我的评论 EDIT(3):进行了一些sode清理并更改为仅存储时间。
答案 2 :(得分:0)
from bs4 import BeautifulSoup
import pandas as pd
data = BeautifulSoup(#HTML data)
NamaazName = data.find_all('div', {'class':'NamaazTimeName'})
NamaazTime = data.find_all('div', {'class':'NamaazTime'})
for i in range(len(NamaazName)):
coll[NamaazName[i].text] = NamaazTime[i].text
master_data.columns=pd.DataFrame()
master_data['NamaazName'] = coll.keys()
master_data['NamaazTime'] = coll.values()
print(master_data)
输出
Nammaz NammazTime
0 Fajr 04:42
1 Sunrise 06:32
2 Zohr 13:02
3 Maghrib 19:33