我目前正在从TFL API获取天气预报。一旦提取了“今天的预测”的json,该段落的中间就会出现随机符号-我认为这可能是从API格式化的。
这是提取的内容:
Bank holiday Monday will stay dry with some long sunny spells. Temperatures will remain warm for the time of year.<br/><br/>PM2.5 particle pollution increased rapidly overnight. Increases began across Essex and spread across south London. Initial chemical analysis suggests that this is composed mainly of wood burning particles but also with some additional particle pollution from agriculture and traffic. This would be consistent with an air flow from the continent where large bonfires are part of the Easter tradition. This will combine with our local emissions today and 'high' PM2.5 is possible.<br/><br/>The sunny periods, high temperatures and east winds will bring additional ozone precursors allowing for photo-chemical generation of ozone to take place. Therefore 'moderate' ozone is likely.<br/><br/>Air pollution should remain 'Low' through the forecast period for the following pollutants:<br/><br/>Nitrogen Dioxide<br/>Sulphur Dioxide.
该段比必要的内容更详细,前两个句子是我所需要的。我认为.split
是一个好主意,并通过for循环运行它直到到达字符串"<br/><br/>PM2.5"
。
但是,我不确定每天是否会使用相同的字符串,或者不确定简化的预测仍然只是前两个句子。
有人对我如何解决这个问题有任何想法吗?
作为参考,这是我目前所拥有的代码,目前还没有任何其他内容。
import urllib.parse
import requests
main_api = "https://api.tfl.gov.uk/AirQuality?"
idno = "1"
url = main_api + urllib.parse.urlencode({"$id": idno})
json_data = requests.get(main_api).json()
disclaimer = json_data['disclaimerText']
print("Disclaimer: " + disclaimer)
print()
today_weather = json_data['currentForecast'][0]['forecastText']
print("Today's forecast: " + today_weather.replace("<br/><br/>"," "))
答案 0 :(得分:1)
我相信,如果您清理HTML标记,然后使用NLTK的句子标记器将段落标记化,那么您应该会很好。
from nltk.tokenize import sent_tokenize
import urllib.parse
import requests
import re
main_api = "https://api.tfl.gov.uk/AirQuality?"
idno = "1"
url = main_api + urllib.parse.urlencode({"$id": idno})
json_data = requests.get(main_api).json()
disclaimer = json_data['disclaimerText']
print("Disclaimer: " + disclaimer)
print()
# Clean out HTML tags
today_weather_str = re.sub(r'<.*?>', '', json_data['currentForecast'][0]['forecastText'])
# Get the first two sentences out of the list
today_weather = ' '.join(sent_tokenize(today_weather_str)[:2])
print("Today's forecast: {}".format(today_weather))
答案 1 :(得分:0)
要编写未针对每个数据集进行显式编码的脚本,则需要查找某种模式,如果该模式是所需字符串始终为前两行,则可以使用{{ 1}}循环:
for
如果简化预测周围似乎存在某种模式,您也可以尝试使用正则表达式。
但是,在没有更多有关数据集看起来如何的信息的情况下,我认为这是我能想到的最好的方法。
答案 2 :(得分:0)
这些“随机符号”
<br/>
是HTML编码
<br/>
或HTML中的新行,因此看起来很可靠:
lines = today_weather.split('<br/>')
我认为假设第一行是您所追求的是合理的:
short_forecast = lines[0]
时间会证明这是否正确,但是您可以轻松调整以包含更多或更少的内容。