Python BeautifulSoup问题解析表

时间:2016-06-15 02:00:40

标签: python parsing beautifulsoup

您好我使用beautifulsoup来解析以下网站中的表格,但并未返回所有行。我正在寻找文章标签(http://itp.ne.jp/result/?kw=%92J%98e%8E%95%89%C8%83N%83%8A%83j%83b%83N

url = 'http://itp.ne.jp/result/?kw=%92J%98e%8E%95%89%C8%83N%83%8A%83j%83b%83N'
page = requests.get(url)
prefsoup = BeautifulSoup(page.content,"html.parser")

art= prefsoup.find_all("article")

print(art)

[<article>
<section class="noimage">
<h4 class="clearfix">
<a class="blackText" href="/shop/KN0114031400001406/" target="_blank">谷脇歯科クリニック</a>
<a class="itrademark24" href="/stats_click/?s_bid=KN0114031400001406&amp;s_sid=FSP-LSR-001&amp;s_fr=V09&amp;s_ck=C12&amp;s_acd=7" target="_blank"><img alt="付加価値情報" src="/img/pc/shop/icon_itrade_7.gif"/></a>
</h4>
<p><span class="inlineSmallHeader">住所</span> 〒060-0042 北海道札幌市中央区大通西5丁目 <a class="boxedLink navigationLink" href="/shop/KN0114031400001406/map.html" target="_blank">地図・ナビ</a></p>
<p><span class="inlineSmallHeader">TEL</span>
<a class="whiteboxicon popup_04" href="/guide/phonemark.html">(代)</a>
<b>011-213-1184</b></p>
<p>
<span class="inlineSmallHeader">URL</span>
http://taniwaki-dental.com</p></section></article>]

然而,它缺少包含电子邮件信息的最后一段

<p><span class="inlineSmallHeader">EMAIL</span>
taniwaki@kzh.biglobe.ne.jp<!-- br-->            
</p>

此外len(art)返回2,art [1]返回索引超出范围错误。

尝试了几页并遇到了同样的问题。

1 个答案:

答案 0 :(得分:0)

使用解析器html.parser代替prefsoup = BeautifulSoup(page.content,"html.parser") ,它将像魅力一样工作。您只需要更改以下代码行 -

prefsoup = BeautifulSoup(page.content,"html5lib")

到 -

html5lib

当然,您需要使用pip install html5lib安装import numpy as np import csv X = X = np.array([[float(cell) for cell in row[:-1]] for row in csv.reader(open('C:/Users/Acer/Desktop/final sem/Project/Implementation/nn.csv'))]) Y = np.array([float(row[-1]) for row in csv.reader(open('C:/Users/Acer/Desktop/final sem/Project/Implementation/nn.csv'))]) syn0 = 2*np.random.random((34,26)) - 1 syn1 = 2*np.random.random((26,18)) - 1 syn2 = 2*np.random.random((18,11)) - 1 syn3 = 2*np.random.random((11,6)) - 1 for j in xrange(350): l1 = 1/(1+np.exp(-(np.dot(X,syn0)))) l2 = 1/(1+np.exp(-(l1.dot(syn1)))) l3 = 1/(1+np.exp(-(l2.dot(syn2)))) l4 = 1/(1+np.exp(-(l3.dot(syn3)))) l4_delta = (Y[j] - l4)*(l4*(1-l4)) l3_delta = l4_delta.dot(syn3.T) * (l3 * (1-l3)) l2_delta = l3_delta.dot(syn2.T) * (l2 * (1-l2)) l1_delta = l2_delta.dot(syn1.T) * (l1 * (1-l1)) syn3 += np.dot(l3.transpose(),l4_delta) syn2 += np.dot(l2.transpose(),l3_delta) syn1 += np.dot(l1.transpose(),l2_delta) syn0 += X.T.dot(l1_delta)

同样检查一下 - https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser