python - 解析网址

时间:2017-03-25 21:41:34

标签: python urlparse

我正在编写一个简单的脚本,用于检查谷歌是否存在网站,首先搜索确定的关键字。

现在,这是解析url并返回主机名的函数:

@model FpisNada.Models.Supplier
@{
ViewBag.Title = "Index";

Layout = null;
}

@using (Html.BeginForm())
{
@Html.AntiForgeryToken()
@Html.ValidationSummary(true)

@Html.TextBoxFor(model => model.SupplierID, new { @placeholder = "pib dobavljaca", style = " float:left" })

<div class="col-md-9">
@if (ViewBag.ListTown!= null)
{
@Html.DropDownListFor(m => m.TownID, ViewBag.ListTown as SelectList, "--select town--", new { @class = "form-control", style = " float:left" })

}

@Html.DropDownListFor(m => m.StreetID, new SelectList(""), "--select street--", new { @class = "form-control", style = " float:left" })


<div class="container">


@Html.TextBoxFor(model => model.AdressNumber, new { @class = "form-control"})
@Html.TextBoxFor(model => model.Email, new { @class = "form-control" })
@Html.TextBoxFor(model => model.Name, new { @class = "form-control" })
@Html.TextBoxFor(model => model.Phone, new { @class = "form-control"})
</div>
</div>
<input type="submit" value="Edit" />
}

My controller method:

[HttpGet]
public ActionResult Edit(int id)
{

Supplier supplier= db.Supplier.Find(id);

return View(supplier);
}
[HttpPost]
[ValidateAntiForgeryToken]
public ActionResult Edit( Supplier supplier)
{

try
{
if (ModelState.IsValid)
{

db.Entry(supplier).State = EntityState.Modified;

db.SaveChanges();
return RedirectToAction("ChangeSupplier");
}
}
catch (DataException /* dex */)
{
//Log the error (uncomment dex variable name after DataException and add a line here to write a log.)
ModelState.AddModelError("", "Unable to save changes. Try again, and if the problem persists, see your system administrator.");
}
return View(supplier);

并从以下选择的标签列表开始:

def parse_url(url):
    url = urlparse(url)
    hostname = url.netloc
    return hostname

我写了这个:

linkElems = soup.select('.r a') #in google first page the resulting urls have class r

在最后一行中,在第二行中,我必须从第七个索引开始,因为所有href值都以 for link in linkElems: l = link.get("href")[7:] url = parse_url(l) if "www.example.com" == url: #do stuff (ex store in a list, etc) 开头。

我正在学习python,所以我想知道是否有更好的方法来做到这一点,或者只是一个替代方法(可能使用正则表达式或替换方法或来自urlparse库)

1 个答案:

答案 0 :(得分:0)

您可以使用python lxml 模块执行比 BeautifulSoup 快一个数量级的模块。

这可以这样做:

import requests
from lxml import html

blah_url = "https://www.google.co.in/search?q=blah&oq=blah&aqs=chrome..69i57j0l5.1677j0j4&sourceid=chrome&ie=UTF-8"
r = requests.get(blah_url).content
root = html.fromstring(r)

print(root.xpath('//h3[@class="r"]/a/@href')[0].replace('/url?q=', ''))
print([url.replace('/url?q=', '') for url in root.xpath('//h3[@class="r"]/a/@href')])

这将导致:

http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA  
['http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA', 'http://www.dictionary.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggZMAE&usg=AFQjCNE1UVR3krIQHfEuIzHOeL0ZvB5TFQ', 'http://www.dictionary.com/browse/blah-blah-blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggeMAI&usg=AFQjCNFw8eiSqTzOm65PQGIFEoAz0yMUOA', 'https://en.wikipedia.org/wiki/Blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggjMAM&usg=AFQjCNFxEB8mEjEy6H3YFOaF4ZR1n3iusg', 'https://www.merriam-webster.com/dictionary/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggpMAQ&usg=AFQjCNHYXX53LmMF-DOzo67S-XPzlg5eCQ', 'https://en.oxforddictionaries.com/definition/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgguMAU&usg=AFQjCNGlgcUx-BpZe0Hb-39XvmNua2n8UA', 'https://en.wiktionary.org/wiki/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggzMAY&usg=AFQjCNGc9VmmyQls_rOBOR_lMUnt1j3Flg', 'http://dictionary.cambridge.org/dictionary/english/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg5MAc&usg=AFQjCNHJgZR1c6VY_WgFa6Rm-XNbdFJGmA', 'http://www.thesaurus.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg-MAg&usg=AFQjCNEtnpmKxVJqUR7P1ss4VHnt34f4Kg', 'https://www.youtube.com/watch%3Fv%3D3taEuL4EHAg&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQtwIIRTAJ&usg=AFQjCNFnKlMFxHoYAIkl1MCrc_OXjgiClg']