我正在开发一个工具来帮助我格式化用户发送的文本,其中一个功能应该是检测小写的专有名词或首字母缩略词,这样我就可以将它们的一些字符设置为大写字母。
例如:
用作输入的单个字符串在 40行中平均 200字。我正在使用 javascript 。
我知道我可能找不到现有专有名词和首字母缩略词的全部,考虑到我的输入甚至可能有多种不同的语言。 但是,我想提出一些建议,试图在保持良好性能的同时最大限度地检测错误的小写单词。
我的第一个策略是构建一个常用的首字母缩略词和专有名词的数组,我用这个数字填充了260个单词。然后我逐行检查输入字符串,使用正则表达式尝试查找数组中的每个单词。
毋庸置疑,它最终有点慢,因为如果我们只考虑import urllib.parse
from collections import namedtuple
from datetime import datetime
import bs4
import requests
HostingCompany = namedtuple('HostingCompany',
('name', 'country', 'websites', 'usage', 'usage_by_top', 'update_time'))
class MyIpLink:
url_base = 'https://myip.ms'
def __init__(self, tag: bs4.element.Tag, *, is_anchor=False):
a_tag = tag.find('a')
if is_anchor: # treat `tag` as an anchor tag
a_tag = tag
self.text = tag.text.strip()
self.url = urllib.parse.urljoin(self.url_base, a_tag['href'])
def __repr__(self):
return f'{self.__class__.__name__}(text={repr(self.text)}, url={repr(self.url)})'
url = 'https://myip.ms/browse/web_hosting/World_Web_Hosting_Global_Statistics.html'
html = requests.get(url).text
soup = bs4.BeautifulSoup(html, 'html.parser')
rows = soup.select('#web_hosting_tbl > tbody > tr')[::2] # skips "more info" rows
companies = []
for row in rows:
tds = row.find_all('td')
name = MyIpLink(tds[1])
country = MyIpLink(tds[2])
websites = [MyIpLink(a, is_anchor=True) for a in tds[3].find_all('a')]
usage = MyIpLink(tds[4])
usage_by_top = MyIpLink(tds[5])
update_time = datetime.strptime(tds[6].text.strip(), '%d %b %Y, %H:%M')
company = HostingCompany(name, country, websites, usage, usage_by_top, update_time)
companies.append(company)
import pprint
pprint.pprint(companies)
print(companies[0].name.text)
print(companies[0].name.url)
print(companies[0].country.text)
循环,它通常会对每个字符串进行至少10400次比较。
代码低于:
[HostingCompany(name=MyIpLink(text='Godaddy.com, LLC', url='https://myip.ms/view/web_hosting/2433/Godaddy_com_LLC.html'), country=MyIpLink(text='USA', url='https://myip.ms/view/best_hosting/USA/Best_Hosting_in_USA.html'), websites=[MyIpLink(text='www.godaddy.com', url='https://myip.ms/go.php?1229687315_ITg7Im93dCkWE0kNAhQSEh0FUeHq5Q==')], usage=MyIpLink(text='512,701 sites', url='https://myip.ms/browse/sites/1/ownerID/2433/ownerIDii/2433'), usage_by_top=MyIpLink(text='951 sites', url='https://myip.ms/browse/sites/1/rankii/100000/ownerID/2433/ownerIDii/2433'), update_time=datetime.datetime(2018, 5, 2, 5, 17)),
HostingCompany(name=MyIpLink(text='Cloudflare, Inc', url='https://myip.ms/view/web_hosting/4638/Cloudflare_Inc.html'), country=MyIpLink(text='USA', url='https://myip.ms/view/best_hosting/USA/Best_Hosting_in_USA.html'), websites=[MyIpLink(text='www.cloudflare.com', url='https://myip.ms/go.php?840626136_OiEsK2ROSxAdGl4QGhYJG+Tp6fnrv/f49w==')], usage=MyIpLink(text='488,119 sites', url='https://myip.ms/browse/sites/1/ownerID/4638/ownerIDii/4638'), usage_by_top=MyIpLink(text='16,160 sites', url='https://myip.ms/browse/sites/1/rankii/100000/ownerID/4638/ownerIDii/4638'), update_time=datetime.datetime(2018, 5, 2, 5, 10)),
HostingCompany(name=MyIpLink(text='Amazon.com, Inc', url='https://myip.ms/view/web_hosting/615/Amazon_com_Inc.html'), country=MyIpLink(text='USA', url='https://myip.ms/view/best_hosting/USA/Best_Hosting_in_USA.html'), websites=[MyIpLink(text='www.amazonaws.com', url='https://myip.ms/go.php?990446041_JyYhKGFxThMQHUMRHhcDExHj8vul7f75')], usage=MyIpLink(text='453,230 sites', url='https://myip.ms/browse/sites/1/ownerID/615/ownerIDii/615'), usage_by_top=MyIpLink(text='9,557 sites', url='https://myip.ms/browse/sites/1/rankii/100000/ownerID/615/ownerIDii/615'), update_time=datetime.datetime(2018, 5, 2, 5, 4)),
...
]
Godaddy.com, LLC
https://myip.ms/view/web_hosting/2433/Godaddy_com_LLC.html
USA
Obs:正则表达式有点复杂,因为单词边界for
将无法识别重音字符(“á”,“à”,“ç”等),所以我不得不使用以function format(text) {
var input = text.split("\n");
var result = "";
for (var line in input) {
result += handleExceptionsAndTypos(input[line]);
}
return result;
}
function handleExceptionsAndTypos(s) {
// getExceptionsAndTypos() returns the list of common proper nouns and acronyms
var exceptionsAndTypos = getExceptionsAndTypos();
var w;
for (w in exceptionsAndTypos) {
if (exceptionsAndTypos.hasOwnProperty(w)) {
s = s.replace(new RegExp("(^|[^A-Za-z\u00E0-\u00FC])" + w + "(?=([^A-Za-z\u00E0-\u00FC]|$))", 'ig'), "$1"+ exceptionsAndTypos[w]);
}
}
return s;
}
作为替代。对此问题的任何建议也是受欢迎的。
答案 0 :(得分:0)
答案 1 :(得分:-1)
我想出了一个简单的想法,但我真的不知道要提高性能的程度。而且,这个基本思想通常只适用于首字母缩略词。
首先,您将根据长度对名词进行分类,然后对于您要比较的给定行中的每个单词,您将使用其长度来决定您应该开始比较哪个类别,然后使用其他因素分类(例如,第一个字母)你会删除更多的单词,这样你就可以缩小专有的名词和首字母缩略词来对付那个单词。
一个简短的例子如下:
let str = "usa is where ..."
let acr_list = {
2: ["FR", "AR", "UK", "NB" ...etc],
3: ["USA" ...etc]
...
}
str.split(" ").foreach(w => {
if (arc_list[w.length].indexOf(w.toUpperCase() !== -1) {
//Do whatever you want to do
}
});
我不知道这个想法是否可以更加强化......但当然欢迎任何建议。