使用BeautifulSoup添加链接。无法过去特殊字符

时间:2019-11-13 21:31:14

标签: python html beautifulsoup

我正在使用Python calendar包创建HTML日历,但由于Python calendar仅显示文本日期,因此希望每个日期都为一个链接。我正在使用BeautifulSoup4查找所有元素并将其替换为链接。但是,当我这样做时,它会将我的大于和小于符号更改为>&lt。我什至尝试使用unescape python包中的html强制使用它。它做同样的事情。

cal = calendar.HTMLCalendar(calendar.SUNDAY)
soup = BeautifulSoup(cal.formatmonth(2019, 11))

创建:

<html>
 <body>
  <table border="0" cellpadding="0" cellspacing="0" class="month">
   <tr>
    <th class="month" colspan="7">
     November 2019
    </th>
   </tr>
   <tr>
    <th class="sun">
     Sun
    </th>
    <th class="mon">
     Mon
    </th>
    <th class="tue">
     Tue
    </th>
    <th class="wed">
     Wed
    </th>
    <th class="thu">
     Thu
    </th>
    <th class="fri">
     Fri
    </th>
    <th class="sat">
     Sat
    </th>
   </tr>
   <tr>
    <td class="noday">
    </td>
    <td class="noday">
    </td>
    <td class="noday">
    </td>
    <td class="noday">
    </td>
    <td class="noday">
    </td>
    <td class="fri">
     1
    </td>
    <td class="sat">
     2
    </td>
   </tr>
   <tr>
    <td class="sun">
     3
    </td>
    <td class="mon">
     4
    </td>
    <td class="tue">
     5
    </td>
    <td class="wed">
     6
    </td>
    <td class="thu">
     7
    </td>
    <td class="fri">
     8
    </td>
    <td class="sat">
     9
    </td>
   </tr>
   <tr>
    <td class="sun">
     10
    </td>
    <td class="mon">
     11
    </td>
    <td class="tue">
     12
    </td>
    <td class="wed">
     13
    </td>
    <td class="thu">
     14
    </td>
    <td class="fri">
     15
    </td>
    <td class="sat">
     16
    </td>
   </tr>
   <tr>
    <td class="sun">
     17
    </td>
    <td class="mon">
     18
    </td>
    <td class="tue">
     19
    </td>
    <td class="wed">
     20
    </td>
    <td class="thu">
     21
    </td>
    <td class="fri">
     22
    </td>
    <td class="sat">
     23
    </td>
   </tr>
   <tr>
    <td class="sun">
     24
    </td>
    <td class="mon">
     25
    </td>
    <td class="tue">
     26
    </td>
    <td class="wed">
     27
    </td>
    <td class="thu">
     28
    </td>
    <td class="fri">
     29
    </td>
    <td class="sat">
     30
    </td>
   </tr>
  </table>
 </body>
</html>

所以在这里,我尝试用链接替换文本字符串:

for elem in soup.find_all('td', class_=['sun', 'mon', 'tues', 'wed', 'thu', 'fri', 'sat']):
    elem.string = '<a href="{}.html">'.format(elem.string) + elem.string + '</a>'

哪个创建:

<bound method Tag.prettify of <html><body><table border="0" cellpadding="0" cellspacing="0" class="month">
<tr><th class="month" colspan="7">November 2019</th></tr>
<tr><th class="sun">Sun</th><th class="mon">Mon</th><th class="tue">Tue</th><th class="wed">Wed</th><th class="thu">Thu</th><th class="fri">Fri</th><th class="sat">Sat</th></tr>
<tr><td class="noday"> </td><td class="noday"> </td><td class="noday"> </td><td class="noday"> </td><td class="noday"> </td><td class="fri">&lt;a href="1.html"&gt;1&lt;/a&gt;</td><td class="sat">&lt;a href="2.html"&gt;2&lt;/a&gt;</td></tr>
<tr><td class="sun">&lt;a href="3.html"&gt;3&lt;/a&gt;</td><td class="mon">&lt;a href="4.html"&gt;4&lt;/a&gt;</td><td class="tue">5</td><td class="wed">&lt;a href="6.html"&gt;6&lt;/a&gt;</td><td class="thu">&lt;a href="7.html"&gt;7&lt;/a&gt;</td><td class="fri">&lt;a href="8.html"&gt;8&lt;/a&gt;</td><td class="sat">&lt;a href="9.html"&gt;9&lt;/a&gt;</td></tr>
<tr><td class="sun">&lt;a href="10.html"&gt;10&lt;/a&gt;</td><td class="mon">&lt;a href="11.html"&gt;11&lt;/a&gt;</td><td class="tue">12</td><td class="wed">&lt;a href="13.html"&gt;13&lt;/a&gt;</td><td class="thu">&lt;a href="14.html"&gt;14&lt;/a&gt;</td><td class="fri">&lt;a href="15.html"&gt;15&lt;/a&gt;</td><td class="sat">&lt;a href="16.html"&gt;16&lt;/a&gt;</td></tr>
<tr><td class="sun">&lt;a href="17.html"&gt;17&lt;/a&gt;</td><td class="mon">&lt;a href="18.html"&gt;18&lt;/a&gt;</td><td class="tue">19</td><td class="wed">&lt;a href="20.html"&gt;20&lt;/a&gt;</td><td class="thu">&lt;a href="21.html"&gt;21&lt;/a&gt;</td><td class="fri">&lt;a href="22.html"&gt;22&lt;/a&gt;</td><td class="sat">&lt;a href="23.html"&gt;23&lt;/a&gt;</td></tr>
<tr><td class="sun">&lt;a href="24.html"&gt;24&lt;/a&gt;</td><td class="mon">&lt;a href="25.html"&gt;25&lt;/a&gt;</td><td class="tue">26</td><td class="wed">&lt;a href="27.html"&gt;27&lt;/a&gt;</td><td class="thu">&lt;a href="28.html"&gt;28&lt;/a&gt;</td><td class="fri">&lt;a href="29.html"&gt;29&lt;/a&gt;</td><td class="sat">&lt;a href="30.html"&gt;30&lt;/a&gt;</td></tr>
</table>
</body></html>>

如何让BeautifulSoup4实际放入链接?

预期结果:

<tr>
    <td class="sun">
     <a href="3.index.html">3</a>
    </td>
<<<etc>>>

1 个答案:

答案 0 :(得分:0)

以我的评论为基础,您需要插入一个新标签,而不是修改<td>的文本:

for elem in soup.find_all('td', class_=['sun', 'mon', 'tues', 'wed', 'thu', 'fri', 'sat']):
    # Grab the current element text
    # I got weird behavior where it would always use the last element's string if I didn't do this
    text = elem.string
    elem.string = ''

    # Create new tag using "elem"'s text
    new = soup.new_tag('a', href="{}.html".format(text))
    new.string = text

    # Insert <a> tag
    elem.append(new)

这将产生:

<html>
 <body>
  <table border="0" cellpadding="0" cellspacing="0" class="month">
   <tr>
    <th class="month" colspan="7">
     November 2019
    </th>
   </tr>
   <tr>
    <th class="sun">
     Sun
    </th>
    <th class="mon">
     Mon
    </th>
    <th class="tue">
     Tue
    </th>
    <th class="wed">
     Wed
    </th>
    <th class="thu">
     Thu
    </th>
    <th class="fri">
     Fri
    </th>
    <th class="sat">
     Sat
    </th>
   </tr>
   <tr>
    <td class="noday">
    </td>
    <td class="noday">
    </td>
    <td class="noday">
    </td>
    <td class="noday">
    </td>
    <td class="noday">
    </td>
    <td class="fri">
     <a href="1.html">
      1
     </a>
    </td>
    <td class="sat">
     <a href="2.html">
      2
     </a>
    </td>
   </tr>
   <tr>
    <td class="sun">
     <a href="3.html">
      3
     </a>
    </td>
    <td class="mon">
     <a href="4.html">
      4
     </a>
    </td>
    <td class="tue">
     5
    </td>
    <td class="wed">
     <a href="6.html">
      6
     </a>
    </td>
    <td class="thu">
     <a href="7.html">
      7
     </a>
    </td>
    <td class="fri">
     <a href="8.html">
      8
     </a>
    </td>
    <td class="sat">
     <a href="9.html">
      9
     </a>
    </td>
   </tr>
   <tr>
    <td class="sun">
     <a href="10.html">
      10
     </a>
    </td>
    <td class="mon">
     <a href="11.html">
      11
     </a>
    </td>
    <td class="tue">
     12
    </td>
    <td class="wed">
     <a href="13.html">
      13
     </a>
    </td>
    <td class="thu">
     <a href="14.html">
      14
     </a>
    </td>
    <td class="fri">
     <a href="15.html">
      15
     </a>
    </td>
    <td class="sat">
     <a href="16.html">
      16
     </a>
    </td>
   </tr>
   <tr>
    <td class="sun">
     <a href="17.html">
      17
     </a>
    </td>
    <td class="mon">
     <a href="18.html">
      18
     </a>
    </td>
    <td class="tue">
     19
    </td>
    <td class="wed">
     <a href="20.html">
      20
     </a>
    </td>
    <td class="thu">
     <a href="21.html">
      21
     </a>
    </td>
    <td class="fri">
     <a href="22.html">
      22
     </a>
    </td>
    <td class="sat">
     <a href="23.html">
      23
     </a>
    </td>
   </tr>
   <tr>
    <td class="sun">
     <a href="24.html">
      24
     </a>
    </td>
    <td class="mon">
     <a href="25.html">
      25
     </a>
    </td>
    <td class="tue">
     26
    </td>
    <td class="wed">
     <a href="27.html">
      27
     </a>
    </td>
    <td class="thu">
     <a href="28.html">
      28
     </a>
    </td>
    <td class="fri">
     <a href="29.html">
      29
     </a>
    </td>
    <td class="sat">
     <a href="30.html">
      30
     </a>
    </td>
   </tr>
  </table>
 </body>
</html>

您的预期输出在href URL中包含“索引”,但是根据您的问题和示例HTML,我不确定您期望它来自何处。如果需要,可以将其放入format调用中。