Question

我开始学习python并且想尝试使用BeautifulSoup来提取下面的html中的元素。

此html取自录音系统，该系统以当地时间，UTC，通话时长，被叫号码，姓名，主叫号码，姓名等记录时间和日期。这些条目通常有数百个。

我尝试做的是提取元素并将它们打印成一行以逗号分隔的格式，以便与来自呼叫管理器的呼叫详细记录进行比较。这有助于验证是否记录了所有来电并且没有错过。

我相信BeautifulSoup是做到这一点的合适工具有人能指出我正确的方向吗？

＆＃13;

<tbody>
   <tr class="formRowLight">

<td class="formRowLight" >24/10/16<br>16:24:47</td>
<td class="formRowLight" >24/10/16 07:24:47</td>
<td class="formRowLight" >00:45</td>
<td class="formRowLight" >31301</td>
<td class="formRowLight" >Joe Smith</td>
<td class="formRowLight" >31111</td>
<td class="formRowLight" >Jane Doe</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >1432875648934</td>
<td align="center" class"formRowLight">&nbsp;</td>

   <tr class="formRowLight">

<td class="formRowLight" >24/10/16<br>17:33:02</td>
<td class="formRowLight" >24/10/16 08:33:02</td>
<td class="formRowLight" >00:58</td>
<td class="formRowLight" >35664</td>
<td class="formRowLight" >Billy Bob</td>
<td class="formRowLight" >227045665</td>
<td class="formRowLight" >James Dean</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >9934959586849</td>
<td align="center" class"formRowLight">&nbsp;</td>
</tr>
</tbody>

＆＃13;

Answer 1

pandas.read_html()会让事情变得更容易 - 它会将表格数据从HTML表格转换为dataframe，如果需要，您可以稍后dump into CSV。

以下是一个示例代码，可帮助您入门：

import pandas as pd

data = """
<table>
    <thead>
        <tr>
            <th>Date</th>
            <th>Name</th>
            <th>ID</th>
        </tr>
    </thead>
    <tbody>
        <tr class="formRowLight">
            <td class="formRowLight">24/10/16<br>16:24:47</td>
            <td class="formRowLight">Joe Smith</td>
            <td class="formRowLight">1432875648934</td>
        </tr>

        <tr class="formRowLight">
            <td class="formRowLight">24/10/16<br>17:33:02</td>
            <td class="formRowLight">Billy Bob</td>
            <td class="formRowLight">9934959586849</td>
        </tr>
    </tbody>
</table>"""

df = pd.read_html(data)[0]
print(df.to_csv(index=False))

打印：

Date,Name,ID
24/10/1616:24:47,Joe Smith,1432875648934
24/10/1617:33:02,Billy Bob,9934959586849

仅供参考，read_html()实际上使用BeautifulSoup来解析幕后的HTML。

Answer 2

import BeautifulSoup
import urllib2
import requests

request = urllib2.Request(your url)
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)

mylist = []
div = soup.findAll('tr', {"class":"formRowLight"})
for line in div:
    text= video.findNext('td',{"class":"formRowLight"}).text
    mylist.append(text)

print mylist

但您需要编辑此代码以防止任何重复内容。

Answer 3

是的，BeautifulSoup是解决此问题的好工具。让你入门的东西如下：

from bs4 import BeautifulSoup

with open("my_log.html") as log_file:
    html = log_file.read()
soup = BeautifulSoup(html) 
#normally you specify a parser too `(html, 'lxml')` for example
#without specifying a parser, it will warn you and select one automatically

table_rows = soup.find_all("tr") #get list of all <tr> tags
for row in table_rows:
    table_cells = row.find_all("td") #get list all <td> tags in row
    joined_text = ",".join(cell.get_text() for cell in table_cells)
    print(joined_text)

然而，pandas read_html可能会使这更加无缝，正如此问题的另一个答案所述。可以说pandas可能是一个更好的锤子，但是学会使用BeautifulSoup也将为你提供在未来刮掉各种HTML的技能。

Answer 4

首先获取html字符串列表，然后按照此Convert BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements

进行操作

然后执行以下操作，

这将获取您想要的所有元素值！

 for element in html_list:
    output = soup.select(element)[0].text
    print("%s ," % output)

这将为您提供您想要的东西，

希望有所帮助！

如何使用BeautifulSoup从html中提取元素

4 个答案: