如何从网页抓取数据中删除元素?

时间:2017-06-22 07:52:31

标签: python html web-scraping beautifulsoup

我使用了Beautifulsoup和soup.findAll来获取相关信息,但我想删除1个值(在<TR>...</TR>之间),将<TR>标记删除。 我怎样才能做到这一点? Python 2.7

.
.
.

soup = BeautifulSoup(x, 'lxml')

tab6col = soup.findAll('table', { "class" : "tab6col" })

这是我的HTML代码:

&#13;
&#13;
[<table border="0" class="tab6col" id="pm">\n<tr><td>\xa0</td><td align="right" class="contentword"><b>2015. \xe9v</b></td><td align="right" class="contentword"><b>2014. \xe9v</b></td><td align="right" class="contentword"><b>2013. \xe9v</b></td><td align="right" class="contentword"><b>2012. \xe9v</b></td><td align="right" class="contentword"><b>2011. \xe9v</b></td></tr><tr><td class="contentword"><b>Besz\xe1mol\xe1si id\xf5szak</b></td><td align="right" class="contentword"><span class="pm_idoszak">2015.01.01. - 2015.12.31.</span></td><td align="right" class="contentword"><span class="pm_idoszak">2014.01.01. - 2014.12.31.</span></td><td align="right" class="contentword"><span class="pm_idoszak">2013.12.30. - 2013.12.31.</span></td><td align="right" class="contentword"><span class="pm_idoszak">Nincs adat.</span></td><td align="right" class="contentword"><span class="pm_idoszak">Nincs adat.</span></td></tr><tr><td>\xa0</td><td align="right" class="contentword">eFt</td><td align="right" class="contentword">eFt</td><td align="right" class="contentword">eFt</td><td align="right" class="contentword">eFt</td><td align="right" class="contentword">eFt</td></tr><tr><td class="contentword">\xc9rt\xe9kes\xedt\xe9s nett\xf3 \xe1rbev\xe9tele</td><td align="right" class="numberc"></td><td align="right" class="numberc"></td><td align="right" class="numberc"></td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Bev\xe9telek</td><td align="right" class="numberc">2 873 821</td><td align="right" class="numberc">3 162 742</td><td align="right" class="numberc">9 194</td><td align="right" class="numberc"></td><td align="right" class="numberc"></td></tr><tr><td class="contentword">\xdczemi eredm\xe9ny</td><td align="right" class="numberc">81 937</td><td align="right" class="numberc">-181 850</td><td align="right" class="numberc">1 755</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Ad\xf3z\xe1s el\xf5tti eredm\xe9ny</td><td align="right" class="numberc">-192 778</td><td align="right" class="numberc">-169 476</td><td align="right" class="numberc">1 755</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">M\xe9rleg szerinti eredm\xe9ny</td><td align="right" class="numberc">-124 099</td><td align="right" class="numberc">0</td><td align="right" class="numberc">1 421</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Ad\xf3zott eredm\xe9ny</td><td align="right" class="numberc">-192 778</td><td align="right" class="numberc">-169 476</td><td align="right" class="numberc">1 579</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Eszk\xf6z\xf6k \xf6sszesen</td><td align="right" class="numberc">37 820 881</td><td align="right" class="numberc">40 695 842</td><td align="right" class="numberc">36 992 091</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Befektetett eszk\xf6z\xf6k</td><td align="right" class="numberc">18 668 826</td><td align="right" class="numberc">18 525 063</td><td align="right" class="numberc">16 925 711</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Forg\xf3eszk\xf6z\xf6k</td><td align="right" class="numberc">19 008 587</td><td align="right" class="numberc">21 877 275</td><td align="right" class="numberc">19 792 420</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">P\xe9nzeszk\xf6z\xf6k</td><td align="right" class="numberc">947 015</td><td align="right" class="numberc">1 056 101</td><td align="right" class="numberc">1 307 515</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Akt\xedv id\xf5beli elhat\xe1rol\xe1sok</td><td align="right" class="numberc">143 468</td><td align="right" class="numberc">293 504</td><td align="right" class="numberc">273 960</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Saj\xe1t t\xf5ke</td><td align="right" class="numberc">2 141 319</td><td align="right" class="numberc">2 184 079</td><td align="right" class="numberc">2 353 554</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">C\xe9ltartal\xe9kok</td><td align="right" class="numberc">29 656</td><td align="right" class="numberc">148 652</td><td align="right" class="numberc">18 960</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">K\xf6telezetts\xe9gek</td><td align="right" class="numberc">35 541 531</td><td align="right" class="numberc">38 059 399</td><td align="right" class="numberc">34 233 518</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">R\xf6vid lej\xe1rat\xfa k\xf6telezetts\xe9gek</td><td align="right" class="numberc">30 519 491</td><td align="right" class="numberc">30 426 014</td><td align="right" class="numberc">26 394 088</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Hossz\xfa lej\xe1rat\xfa k\xf6telezetts\xe9gek</td><td align="right" class="numberc">5 022 040</td><td align="right" class="numberc">7 633 385</td><td align="right" class="numberc">7 839 430</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Passz\xedv id\xf5beli elhat\xe1rol\xe1sok</td><td align="right" class="numberc">108 375</td><td align="right" class="numberc">303 712</td><td align="right" class="numberc">386 059</td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword" colspan="6"><b>P\xe9nz\xfcgyi mutat\xf3k</b></td></tr><tr><td class="contentword">Elad\xf3sodotts\xe1g foka <span onmouseout="remove_hint();" onmouseover="show_hint(this, '&lt;span style=&quot;color: red; font-weight: bold;&quot;&gt;Elad\xf3sodotts\xe1g foka&lt;/span&gt; (K\xf6telezetts\xe9gek/Eszk\xf6z\xf6k \xf6sszesen)&lt;br&gt;&lt;i&gt;Megmutatja, hogy az eszk\xf6z \xe1llom\xe1ny milyen m\xe9rt\xe9kben van megterhelve k\xf6telezetts\xe9gv\xe1llal\xe1ssal. Min\xe9l kisebb a mutat\xf3 \xe9rt\xe9ke, ann\xe1l jobb a c\xe9g meg\xedt\xe9l\xe9se.&lt;/i&gt;');" style="cursor: pointer; color: red; font-family: InformationLogo, Webdings;">i</span></td><td align="right" class="numberc"></td><td align="right" class="numberc"></td><td align="right" class="numberc"></td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Elad\xf3sodotts\xe1g m\xe9rt\xe9ke - Bonit\xe1s <span onmouseout="remove_hint();" onmouseover="show_hint(this, '&lt;span style=&quot;color: red; font-weight: bold;&quot;&gt;Elad\xf3sodotts\xe1g m\xe9rt\xe9ke - Bonit\xe1s&lt;/span&gt; (K\xf6telezetts\xe9gek/Saj\xe1t t\xf5ke)&lt;br&gt;&lt;i&gt;Azt mutatja, hogy a saj\xe1t forr\xe1sok a k\xf6telezetts\xe9gek h\xe1ny sz\xe1zal\xe9k\xe1t fedezik. Pozit\xedv a c\xe9g meg\xedt\xe9l\xe9se, ha a mutat\xf3 \xe9rt\xe9ke tart\xf3san (j\xf3val) 1 alatt van.&lt;/i&gt;');" style="cursor: pointer; color: red; font-family: InformationLogo, Webdings;">i</span></td><td align="right" class="numberc"></td><td align="right" class="numberc"></td><td align="right" class="numberc"></td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">\xc1rbev\xe9tel ar\xe1nyos eredm\xe9ny % <span onmouseout="remove_hint();" onmouseover="show_hint(this, '&lt;span style=&quot;color: red; font-weight: bold;&quot;&gt;\xc1rbev\xe9tel ar\xe1nyos eredm\xe9ny %&lt;/span&gt; (Ad\xf3zott eredm\xe9ny/ Nett\xf3 \xe1rbev\xe9tel)\xd7100&lt;br&gt;&lt;i&gt;A mutat\xf3 az \xe1rbev\xe9tel hat\xe9konys\xe1g\xe1t fejezi ki \xfagy, hogy az \xe1rbev\xe9tel nyeres\xe9gtartalm\xe1t sz\xe1zal\xe9kban szeml\xe9lteti. A c\xe9g meg\xedt\xe9l\xe9se ann\xe1l pozit\xedvabb, min\xe9l magasabb a sz\xe1zal\xe9k.&lt;/i&gt;');" style="cursor: pointer; color: red; font-family: InformationLogo, Webdings;">i</span></td><td align="right" class="numberc"></td><td align="right" class="numberc"></td><td align="right" class="numberc"></td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Likvidit\xe1si gyorsr\xe1ta <span onmouseout="remove_hint();" onmouseover="show_hint(this, '&lt;span style=&quot;color: red; font-weight: bold;&quot;&gt;Likvidit\xe1si gyorsr\xe1ta&lt;/span&gt; ((Forg\xf3eszk\xf6z\xf6k-K\xe9szletek)/R\xf6vid lej.k\xf6telezetts\xe9gek)&lt;br&gt;&lt;i&gt;Azt fejezi ki, hogy az egy \xe9v alatt p\xe9nzz\xe9 tehet\xf5 k\xe9szletek n\xe9lk\xfcli forg\xf3eszk\xf6z\xf6k milyen ar\xe1nyban k\xe9pesek az egy \xe9ven bel\xfcl esed\xe9kes k\xf6telezetts\xe9gek fedez\xe9s\xe9re, azaz milyen a c\xe9g r\xf6vid t\xe1v\xfa fizet\xf5k\xe9pess\xe9ge.&lt;br&gt;A c\xe9g meg\xedt\xe9l\xe9se akkor pozit\xedv, ha ez az ar\xe1ny egyre n\xf6vekv\xf5, ami az azonnali fizet\xf5k\xe9pess\xe9g javul\xe1s\xe1t jelzi.&lt;/i&gt;');" style="cursor: pointer; color: red; font-family: InformationLogo, Webdings;">i</span></td><td align="right" class="numberc"></td><td align="right" class="numberc"></td><td align="right" class="numberc"></td><td align="right" class="numberc">Nincs adat.</td><td align="right" class="numberc">Nincs adat.</td></tr><tr><td class="contentword">Saj\xe1t t\xf5ke ar\xe1nya <span onmouseout="remove_hint();" onmouseover="show_hint(this, '&lt;span style=&quot;color: red; font-weight: bold;&quot;&gt;Saj\xe1t t\xf5ke ar\xe1nya &lt;/span&gt; (Saj\xe1t t\xf5ke / Forr\xe1sok)');" style="cursor: pointer; color: red; font-family: InformationLogo, Webdings;">i</span></td><td align="right" class="numberc">0,06</td><td align="right" class="numberc">0,05</td><td align="right" class="numberc">0,06</td><td align="right" class="numberc"></td><td align="right" class="numberc"></td></tr><tr><td class="contentword">Eszk\xf6zar\xe1nyos nyeres\xe9g <span onmouseout="remove_hint();" onmouseover="show_hint(this, '&lt;span style=&quot;color: red; font-weight: bold;&quot;&gt;Eszk\xf6zar\xe1nyos nyeres\xe9g &lt;/span&gt; (Ad\xf3zott eredm\xe9ny / Eszk\xf6z\xf6k)');" style="cursor: pointer; color: red; font-family: InformationLogo, Webdings;">i</span></td><td align="right" class="numberc">-0,01</td><td align="right" class="numberc">0,00</td><td align="right" class="numberc">0,00</td><td align="right" class="numberc"></td><td align="right" class="numberc"></td></tr><tr><td class="contentword">Bev\xe9telar\xe1nyos eredm\xe9ny <span onmouseout="remove_hint();" onmouseover="show_hint(this, '&lt;span style=&quot;color: red; font-weight: bold;&quot;&gt;Bev\xe9telar\xe1nyos eredm\xe9ny &lt;/span&gt; (Ad\xf3zott eredm\xe9ny / Bev\xe9telek)');" style="cursor: pointer; color: red; font-family: InformationLogo, Webdings;">i</span></td><td align="right" class="numberc">-0,07</td><td align="right" class="numberc">-0,05</td><td align="right" class="numberc">0,17</td><td align="right" class="numberc"></td><td align="right" class="numberc"></td></tr><tr><td class="contentword">Saj\xe1t t\xf5ke ar\xe1nyos nyeres\xe9g <span onmouseout="remove_hint();" onmouseover="show_hint(this, '&lt;span style=&quot;color: red; font-weight: bold;&quot;&gt;Saj\xe1t t\xf5ke ar\xe1nyos nyeres\xe9g &lt;/span&gt; (Ad\xf3zott eredm\xe9ny / Saj\xe1t t\xf5ke)');" style="cursor: pointer; color: red; font-family: InformationLogo, Webdings;">i</span></td><td align="right" class="numberc">-0,09</td><td align="right" class="numberc">-0,08</td><td align="right" class="numberc">0,00</td><td align="right" class="numberc"></td><td align="right" class="numberc"></td></tr><tr><td class="contentword" colspan="6"><b>L\xe9tsz\xe1m:</b> \xa0 136 f\xf5</td>\n</tr></table>]
&#13;
&#13;
&#13;

我想在此表中删除此值:

&#13;
&#13;
<tr><td class="contentword" colspan="6"><b>P\xe9nz\xfcgyi mutat\xf3k</b></td></tr>
&#13;
&#13;
&#13;

我的完整代码:

&#13;
&#13;
import urllib2
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
import pandas as pd
from bs4 import BeautifulSoup
import MySQLdb

def to_2d(l,n):
    return [l[i:i+n] for i in range(0, len(l), n)]

filename=r'output.csv'

resultcsv=open(filename,"wb")
output=csv.writer(resultcsv, delimiter=';',quotechar = '"', quoting=csv.QUOTE_NONNUMERIC, encoding='latin-1')

f = open('opten2.txt', 'r')
x = f.read()

soup = BeautifulSoup(x, 'lxml')

tab6col = soup.find('table', { "class" : "tab6col" })


datatable=[]
for record in tab6col.findAll('tr'):
    for data in record.findAll('td'):
        datatable.append(data.text.encode('latin-1'))

td = datatable.find("td", text="P\xe9nz\xfcgyi mutat\xf3k")
td.decompose()


maindatatable = to_2d(datatable, 6)
print maindatatable
output.writerows(maindatatable)

resultcsv.close()
&#13;
&#13;
&#13;

1 个答案:

答案 0 :(得分:1)

您需要的是decompose()。找到td代码并使用deompose()将其删除。

soup = BeautifulSoup(x, "lxml")
tab6col = soup.find("table", { "class" : "tab6col" })
td = tab6col.find("tr", text="P\xe9nz\xfcgyi mutat\xf3k")
td.decompose()

编辑

试试这个。

import urllib2
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
import pandas as pd
from bs4 import BeautifulSoup
import MySQLdb

filename=r'output.csv'

resultcsv=open(filename,"wb")
output=csv.writer(resultcsv, delimiter=';',quotechar = '"', quoting=csv.QUOTE_NONNUMERIC, encoding='latin-1')

f = open('opten2.txt', 'r')
x = f.read()
f.close()

soup = BeautifulSoup(x, 'lxml') 
tab6col = soup.find('table', { "class" : "tab6col" }) 

datatable=[]
for record in tab6col.find_all('tr'):
    temp_data = []
    for data in record.find_all('td'):
        temp_data.append(data.text.encode('latin-1'))
    datatable.append(temp_data)

output.writerows(datatable)

resultcsv.close()