从HTML TD中检索多个字符串,使用TD元素循环划分字符串并加入它们?

时间:2017-09-26 16:30:43

标签: python string beautifulsoup

所以我试图学习如何使用Beautiful Soup从已经将关键字符串连接到一个块的网站中获取数据。我在网络上可以胜任谷歌,取得了一些成功。我在这一点上陷入困​​境,似乎我错过了一些基本知识,但是我被迫寻求帮助并且四处走动。我希望有人可以指出我正确的方向或给我一些反馈,因为我出错了:

首先::我给出了这个问题的简单版本,因为我不想发布一本书。如果有人愿意深入解决问题和我犯的实际错误,我会将我编写的脚本和实际代码附加在单独的文件中。我相信这是我用字符串和列表做的一个小概念错误,没有进一步的延迟

enter code here

<html>
    <head>


        <center>
        <font face="arial" size="5">
        <table border="0" cellpadding="0" cellspacing="0" width="100%" bgcolor="#000066">

        <tr>
            <td align="left" valign="top" bgcolor="#000066">

          <a href="/"><img height="50" width="540" src="/leftbar-quote.gif" border="0" usemap="#leftbar10b39c7"></a>
              <map name="leftbar10b39c7"><area href="/outside/multi.htm" coords="328,5,390,36" shape="rect">
              <area href="/index.htm" coords="254,5,322,37" shape="rect">
              <area href="#" coords="185,5,251,35" shape="rect" onclick="history.back(); return false;">
              <area href="/cgi-bin/quoteForm.cgi?type=q&sEmail=&part=Engine&dbPart=300.1&dbSubPart=&model=Ford%20Focus&dbModel=27.20&year=2005&name=AutoPartex.net&int=-1&uIMS=&userSearch=exact&seqNum=600000000000000000456918622&ref=&userid=1000&email=&userClaim=&userLang=&userZip=&selleruserid=1000" coords="400,5,460,36" shape="rect">
              <area href="/buyerfaq.htm" coords="470,5,530,36" shape="rect">
              </map>


            </td>
            <td valign=top><div align="right"><img height="50" width="36" src="/result-rs.gif"></div></td>
        </tr>
<tr>
<td COLSPAN=2><table WIDTH="100%"><tr>
            <td width="10" valign="top"><img height="30" width="10" src="/trans4.gif"></td>
            <td width="90%">
            <b>
<div style='font-size:18pt; font-style: italic; color: white;'><b>Results sorted by <u>PRICE</u></b> <span class="small"><b>(Click on heading to re-sort)</b></span><br /></div><font color='#FFFFFF' face='Arial,Helvetica,Geneva,Swiss,SunSans-Regular' size='2'>Click back to modify your previous choice.<br>Most prices do not include extended warranties or shipping.<br>Not all displayed parts are interchangeable.  Please verify with the recycler that the part fits your auto.<br /></font></b></td><td valign=bottom align=center><table bgcolor="#e4e4e4"width=350 cellpadding=3 border=1 cellspacing=0><tr><td align=center><form method="post" action="/cgi-bin/search.cgi" style="display: inline"><input type= hidden name=userDate value="2005"><input type= hidden name=userModel value="Ford Focus"><input type= hidden name=userLocation value="USA"><input type= hidden name=userPreference value="price"><input type= hidden name=userZip value=""><input type="hidden" name="userPage" value="1"><input type="hidden" name="userInterchange" value="None"><input type="hidden" name="userDate2" value="Ending Year"><input type="hidden" name="userSearch" value="int"><input type="hidden" NAME="userClaim" VALUE="">
<input type="hidden" NAME="userClaimer" VALUE="">
<input type="hidden" NAME="userLang" VALUE="">
<input type="hidden" NAME="userLat" VALUE="">
<input type="hidden" NAME="userLong" VALUE="">
<input type="hidden" NAME="userCSA" VALUE="">
<input type="hidden" NAME="userMCO" VALUE="">
<input type="hidden" NAME="userAdjuster" VALUE="">
<input type="hidden" NAME="userItem" VALUE="">
<input type="hidden" NAME="hpsDate" VALUE="">
<input type="hidden" NAME="hpsGroup" VALUE="">
<input type="hidden" NAME="reqId" VALUE="">
<input type="hidden" NAME="thirdMapType" VALUE="">
<input type="hidden" NAME="vendUrl" VALUE="">
<input type="hidden" NAME="iCN" VALUE="">
<input type='hidden' name='limitYears' value=''>
<input type='hidden' name='userIntSelect' value='711575'>
<input type='hidden' name='userVIN' value=''>
<input type='hidden' name='vinSearch' value='0'>
<input type='hidden' name='userVINModelID' value=''>
<input type="hidden" name="uID" value=""><input type="hidden" name="uPass" value=""><table bgcolor="#e4e4e4" width=350 cellpadding=3 border=1 cellspacing=0><tr><td colspan=2 align=center>2005&nbsp;Ford Focus<br>Engine<br></td></tr><tr>
<td align=center>

    <font style="font-size: 10pt">Non-Interchange search for year:<br></font>
    <font style="font-size: 10pt"><b>2005</b><br><br></font>
<br>
<br><font style="font-size: 8pt"><a style="color:blue" href="/cgi-bin/search.cgi?userDate=2005&userModel=Ford%20Focus&userPart=Engine&origPart=&userPreference=price&userZip=&userLat=&userLong=&userVIN=&dbPart=300.1&userIntSelect=711575&userClaimer=&userClaim=&uID=&uPass=&userLocation=USA&userSearch=int">Click Here</a> to see All Interchange Choices </font>
</td>
</table></table></form>
</td></tr></table></td></tr></table><table width="100%" border="1" cellspacing="0" cellpadding="4">
<tr align=center>
    <td><a href='/cgi-bin/search.cgi?userSearch=exact&userPID=1000&userLocation=USA&userIMS=&userInterchange=%5B%7C%7Br&userSide=&userDate=2005&userDate2=2005&dbModel=27.20&userModel=Ford%20Focus&dbPart=300.1&userPart=Engine&sessionID=600000000000000000456918622&userPreference=year&userIntSelect=711575&userUID=0&userBroker=&userPage=1&iKey='>Year</a><br>Part<br>Model</td>
    <td>Description</td>
    <td><a href='/cgi-bin/search.cgi?userSearch=exact&userPID=1000&userLocation=USA&userIMS=&userInterchange=%5B%7C%7Br&userSide=&userDate=2005&userDate2=2005&dbModel=27.20&userModel=Ford%20Focus&dbPart=300.1&userPart=Engine&sessionID=600000000000000000456918622&userPreference=miles&userIntSelect=711575&userUID=0&userBroker=&userPage=1&iKey='>Miles</a></td>
<td><a href='/cgi-bin/search.cgi?userSearch=exact&userPID=1000&userLocation=USA&userIMS=&userInterchange=%5B%7C%7Br&userSide=&userDate=2005&userDate2=2005&dbModel=27.20&userModel=Ford%20Focus&dbPart=300.1&userPart=Engine&sessionID=600000000000000000456918622&userPreference=grade&userIntSelect=711575&userUID=0&userBroker=&userPage=1&iKey='>Part <br> Grade</a></td>   <td>Stock#</td>
    <td>US<br>Price</td>
    <td>Dealer Info</td></tr><tr><td>2005<br>Engine Assembly<br>Ford Focus</td><td><a href=""><img width="100" hspace="3" align="middle" onclick="return popupImg('seller=2013&partGUID=2013-1-282435&vehicleGUID=2013-1-V18432&display=2005%20Ford%20Focus%20Engine%20Assembly-Stock%23%2010286')" src="http://wsimgoh.autopartex.net/2013/2015/10286/2013_18432_05_thumb.jpg"></img></a>ZX4,2.0,EFI,FATO,FWDRUNSGREAT</td><td align=right>&nbsp;</td><td align=center>&nbsp;</td><td>10286</td><td align=center>$350550</td><td><A HREF="http://www.LaPointAuto.com" target="_top">LaPoint Discount MIDW</A> USA-OH(Holland)  <A HREF="/cgi-bin/quoteForm.cgi?type=g&sEmail=shawn@LaPointAuto.com&email=&part=Engine%20Assembly&dbPart=300.1&dbSubPart=&model=Ford%20Focus&dbModel=27.20&year=2005&stockNum=10286&price=350550&desc=ZX4%2C2.0%2CEFI%2CFATO%2CFWDRUNSGREAT&name=LaPoint%20Discount%20MIDW&url=http://www.LaPointAuto.com&int=-1&broker=0&recycler=0&selleruserid=2013&miles=-1&condition=-1&userid=1000&uIMS=&seqNum=600000000000000000456918622&userClaim=&userLang=">Request_Quote</A> 419-865-2329 / 800-845-0270 <A HREF="/cgi-bin/quoteForm.cgi?type=i&sEmail=shawn@LaPointAuto.com&email=&part=Engine%20Assembly&dbPart=300.1&dbSubPart=&model=Ford%20Focus&dbModel=27.20&year=2005&stockNum=10286&price=350550&desc=ZX4%2C2.0%2CEFI%2CFATO%2CFWDRUNSGREAT&name=LaPoint%20Discount%20MIDW&url=http://www.LaPointAuto.com&int=-1&broker=0&recycler=0&selleruserid=2013&miles=-1&condition=-1&userid=1000&uIMS=&seqNum=600000000000000000456918622&userClaim=&userLang=">Request_Insurance_Quote</A><br><a target=_blank href="http://appcgi.autopartex.net/cgi-bin/applet.cgi?sid=2013&brf=&bds=&bsr=price&pin=&pyr=2005&pmd=Ford%20Focus&ppt=Engine%20Assembly&ppr=350550&pst=10286&pgr=&bty=WEB&bem=&bzp=&ses=600000000000000000456918622" onclick='window.open(this.href,this.target,getPrm()); return false'><img src='/images/LiveChat_space.gif' border=0></a></b></td></tr><tr><td>2005<br>Engine Assembly<br>Ford Focus</td><td>TESTED,2.3L,5MT,08/04,FWD,+CORE</td><td align=right>&nbsp;</td><td align=center>&nbsp;</td><td>E94764</td><td align=center>$1500</td><td><A HREF="http://www.ParadiseAutoParts.com" target="_top">Paradise Auto Parts-ELITE</A> USA-MD(Elkton)  <A HREF="/cgi-bin/quoteForm.cgi?type=g&sEmail=mdriver@complete-recycle.com&email=&part=Engine%20Assembly&dbPart=300.1&dbSubPart=&model=Ford%20Focus&dbModel=27.20&year=2005&stockNum=E94764&price=1500&desc=TESTED%2C2.3L%2C5MT%2C08%2F04%2CFWD%2C%2BCORE&name=Paradise%20Auto%20Parts-ELITE&url=http://www.ParadiseAutoParts.com&int=-1&broker=0&recycler=0&selleruserid=2843&miles=-1&condition=-1&userid=1000&uIMS=&seqNum=600000000000000000456918622&userClaim=&userLang=">Request_Quote</A> 888-811-5051/410-620-5051 <A HREF="/cgi-bin/quoteForm.cgi?type=i&sEmail=mdriver@complete-recycle.com&email=&part=Engine%20Assembly&dbPart=300.1&dbSubPart=&model=Ford%20Focus&dbModel=27.20&year=2005&stockNum=E94764&price=1500&desc=TESTED%2C2.3L%2C5MT%2C08%2F04%2CFWD%2C%2BCORE&name=Paradise%20Auto%20Parts-ELITE&url=http://www.ParadiseAutoParts.com&int=-1&broker=0&recycler=0&selleruserid=2843&miles=-1&condition=-1&userid=1000&uIMS=&seqNum=600000000000000000456918622&userClaim=&userLang=">Request_Insurance_Quote</A><br><a target=_blank href="http://appcgi.autopartex.net/cgi-bin/applet.cgi?sid=2843&brf=&bds=&bsr=price&pin=&pyr=2005&pmd=Ford%20Focus&ppt=Engine%20Assembly&ppr=1500&pst=E94764&pgr=&bty=WEB&bem=&bzp=&ses=600000000000000000456918622" onclick='window.open(this.href,this.target,getPrm()); return false'><img src='/images/LiveChat_space.gif' border=0></a></b></td></tr><tr><td>2005<br>Engine Assembly<br>Ford Focus</td><td>175-175</td><td align=right>38,916</td><td align=center>A</td><td>FC6555</td><td align=center>$1250</td><td><A HREF="http://www.DonsSportcar.com" target="_top">Don's Sportcar</A> USA-CO(Pueblo)  <A HREF="/cgi-bin/quoteForm.cgi?type=g&sEmail=parts@DonsSportcar.com&email=&part=Engine%20Assembly&dbPart=300.1&dbSubPart=&model=Ford%20Focus&dbModel=27.20&year=2005&stockNum=FC6555&price=1250&desc=175-175&name=Don's%20Sportcar&url=http://www.DonsSportcar.com&int=-1&broker=0&recycler=0&selleruserid=3776&miles=38.916&condition=-1&userid=1000&uIMS=&seqNum=600000000000000000456918622&userClaim=&userLang=">Request_Quote</A> 800-332-3649 <A HREF="/cgi-bin/quoteForm.cgi?type=i&sEmail=parts@DonsSportcar.com&email=&part=Engine%20Assembly&dbPart=300.1&dbSubPart=&model=Ford%20Focus&dbModel=27.20&year=2005&stockNum=FC6555&price=1250&desc=175-175&name=Don's%20Sportcar&url=http://www.DonsSportcar.com&int=-1&broker=0&recycler=0&selleruserid=3776&miles=38.916&condition=-1&userid=1000&uIMS=&seqNum=600000000000000000456918622&userClaim=&userLang=">Request_Insurance_Quote</A><br><a target=_blank href="http://appcgi.autopartex.net/cgi-bin/applet.cgi?sid=3776&brf=&bds=&bsr=price&pin=&pyr=2005&pmd=Ford%20Focus&ppt=Engine%20Assembly&ppr=1250&pst=FC6555&pgr=A&bty=WEB&bem=&bzp=&ses=600000000000000000456918622" onclick='window.open(this.href,this.target,getPrm()); return false'><img src='/images/LiveChat_space.gif' border=0></a></b></td></tr>
</table>
</div>
</body> </html>

这是html文本和结构。这就是我在方法方面实际需要帮助的地方:

  1. 由于没有css装饰器,我无法找到使用xpath或类似selenium的传统示例。然而,我可能是错的,noob

  2. 我需要将单元格中的文本分隔成单独的字符串。

  3. 使用BeautifulSoup我尝试使用几种方法来获取文本

  4. 在尝试这样的事情后,我收到了这个错误:

    从bs4 import BeautifulSoup

  5. 输入代码

    汤= BeautifulSoup(打开(&#34; ./ test.html&#34;),&#34; lxml&#34;)

    trs = soup.find_all(&#39; tr&#39;)

    for tr in trs:

        tds = tr.find_all("td")
    
        try:
            result = str(tds[0].get_text())
    
        except:
              adjust =  ' '
              continue
    
        result = result.split(" ")
    
        result = str.replace('2005Engine', "2005Engine", "2005 ")  + str.replace('AssemblyFord', "AssemblyFord", "Engine Assembly ") + str.repl$
    
        strresult = ''.join(result)
    
    
        trs = soup.find_all('tr')
    
        for tr in trs:
    
               tds = tr.find_all("td")
    
               tds[0] = strresult
    
               tds.get_text()
    
               print(tds)
    

    错误消息:

    追踪(最近一次通话):   文件&#34; carpartbs5.find.td.py&#34;,第33行,in     tds.get_text()

    文件&#34; /usr/local/lib/python2.7/dist-packages/bs4/element.py",第1807行, getattr

    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
    

    AttributeError:ResultSet对象没有属性&#39; get_text&#39;。您可能正在处理像单个项目一样的项目列表。当你打算调用find()时,你调用了find_all()吗?

    这里是FLIP-SCOUP ::

    当我打印tds时,它会使用任何数组替换第一个td,但是,每当我尝试使用BeautifulSoup中的get_text()方法返回文本时,它会抛出该错误。该错误似乎表明我在一个不可能的事情上调用方法时遇到问题。

    所以我对列表和字符串并不十分清楚。我尝试将我的列表转换为实际的字符串,但它不起作用。我想是因为我使用了一个列表,这就是它无法获取文本的原因。如果是这样,使用BeautifulSoup有更好的方法来实现以下目标:

    1. 从每个元素中的这些位置获取单个文本
    2. 将它们加入一个逗号分隔的字符串结果?
    3. 希望这有帮助,我没有足够的积分来发布图片或上传文件。最后一个文本是我的程序吐出来的,如果我不在Tds变量上尝试和调用一个美丽的方法。先谢谢!

      我的代码

      `来自bs4 import BeautifulSoup

      汤= BeautifulSoup(打开(&#34; ./ test.html&#34;),&#34; lxml&#34;)

      trs = soup.find_all(&#39; tr&#39;)

      for tr in trs:

          tds = tr.find_all("td")
      
          try:
              result = str(tds[0].get_text())
      
          except:
                adjust =  ' '
                continue
      
          result = result.split(" ")
      
          result = str.replace('2005Engine', "2005Engine", "2005 ")  + str.replace('AssemblyFord', "AssemblyFord", "Engine Assembly ") + str.repl$
      
          strresult = ''.join(result)
      
      
          trs = soup.find_all('tr')
      
          for tr in trs:
      
                 tds = tr.find_all("td")
      
                 tds[0] = strresult
      
                 print(tds)'
      

      返回的内容 - 示例

      ['2005 Engine Assembly Ford Focus ', <td>139K</td>, <td align="right">\xa0</td>, <td align="center">\xa0</td>, <td>0232</td>, <td align="center">$800</td>, <td><a href="http://someurl.com" target="_top">Chads Part </a> USA-FL(Jacksonville)  <a href="/cgi-bin/quoteForm.cgi?type=g&amp;sEmail=chadsparts@someplace.com&amp;email=&amp;part=Engine%20Assembly&amp;dbPart=300.1&amp;dbSubPart=&amp;model=Ford%20Focus&amp;dbModel=27.20&amp;year=2005&amp;stockNum=0232&amp;price=800&amp;desc=139K&amp;name=Chads%20Parts&amp;url=http://someurl.com&amp;int=-1&amp;broker=0&amp;recycler=0&amp;selleruserid=3566&amp;miles=-1&amp;condition=-1&amp;userid=1000&amp;uIMS=&amp;seqNum=600000000000000000456918622&amp;userClaim=&amp;userLang=">Request_Quote</a> 1-510-569-4845 <a href="/cgi-bin/quoteForm.cgi?type=i&amp;sEmail=chadsparts@someplace.com&amp;email=&amp;part=Engine%20Assembly&amp;dbPart=300.1&amp;dbSubPart=&amp;model=Ford%20Focus&amp;dbModel=27.20&amp;year=2005&amp;stockNum=0232&amp;price=800&amp;desc=139K&amp;name=Chads%20Parts=rs&amp;url=http://someurl.com&amp;int=-1&amp;broker=0&amp;=0&amp;selleruserid=3566&amp;miles=-1&amp;condition=-1&amp;userid=1000&amp;uIMS=&amp;seqNum=600000000000000000456918622&amp;userClaim=&amp;userLang=">Request_Insurance_Quote</a><br/><a href="http://someurl.com/cgi-bin/applet.cgi?sid=3566&amp;brf=&amp;bds=&amp;bsr=price&amp;pin=&amp;pyr=2005&amp;pmd=Ford%20Focus&amp;ppt=Engine%20Assembly&amp;ppr=800&amp;pst=0232&amp;pgr=&amp;bty=WEB&amp;bem=&amp;bzp=&amp;ses=600000000000000000456918622" onclick="window.open(this.href,this.target,getPrm()); return false" target="_blank"><img border="0" src="/images/LiveChat_space.gif"/></a></td>]
      

      只是要加强::

      我只想将这些元素中的文本用逗号分隔成一个字符串,我可以在准备编写csv文件时再次使用它。

      年,部分,汽车品牌,汽车模型,描述,里程,零件等级,库存号,价格,经销商名称,国家,州,城市,电话

      • 第一个单元格和最后一个单元格最难弄清楚如何将字符串输出到列表中并返回到上面相同顺序的字符串。谢谢!

1 个答案:

答案 0 :(得分:0)

如果你想要3次&#34; 2005发动机组装福特福克斯&#34; (就像在你的html示例中一样),你可以这样做:

  1. 由于所包含的HTML结构不合理,您应该从中获取最后一个表格。
  2. table = soup.findAll('table')[-1]

    1. 接下来,您应该获取除第一个(标题行)
    2. 之外的所有行

      tr = table.findAll('tr')[1:]

      它将是数组。你可以在之后循环遍历行。

      1. 最后你应该从每一行获取第一个td标签。我只会为第一行做这件事。
      2. td = tr[0].td

        1. 现在你会有这样的事情:
        2. <td>2005<br/>Engine Assembly<br/>Ford Focus</td>

          不幸的是,我不知道如何处理这个字符串。 例如,您可以使用此方法:

          td = tr[0].td.children

          您将获得包含所有单词和标签的数组,并根据需要进行处理。