Question

我在Python中使用BeautifulSoup从网站上的表中获取一些数据。汤对象看起来不对劲。我的代码如下所示：

url =r'http://www.the-numbers.com/movie/budgets/all'
source_code = requests.get(url)
text= source_code.text
soup = BeautifulSoup(text,"lxml")

当我看到汤中的标签时，我发现结果看起来不对劲。我想我找到了导致问题的部分。该部分的原始源代码如下所示：

<tr><td class="data">81</td>
<td><a href="/box-office-chart/daily/2010/05/07">5/7/2010</a></td>
<td><b><a href="/movie/Iron-Man-2#tab=summary">Iron Man 2</a></td>
<td class="data">$170,000,000</td>
<td class="data">$312,128,345</td>
<td class="data">$623,256,345</td>
<tr>

但是在汤中打印出那部分会变成：

<tr><td class="data">81</td>
<td><a href="/box-office-chart/daily/2010/05/07">5/7/2010</a></td>
<td><b><a href="/movie">/ I r o n - M a n - 2 # t a b = s u m m a r y " 
&gt;       I r o n   M a n   2 / a &gt; / t d &gt; 
t d   c l a s s = " d a t a " &gt; $ 1 7 0 , 0 0 0 , 0 0 0 / t d &gt; 
t d   c l a s s = " d a t a " &gt; $ 3 1 2 , 1 2 8 , 3 4 5 / t d &gt; 
t d   c l a s s = " d a t a " &gt; $ 6 2 3 , 2 5 6 , 3 4 5 / t d &gt; 
t r &gt;

看起来有一个附加的引号，它导致BeautifulSoup在此之后无法识别任何更多标签。我该如何解决？我尝试了Python的html解析器和lxml。他们给出了同样的结果。

Answer 1

在尝试了很多东西之后，这就是我发现的东西。我试图删除html代码中有问题的部分，但结果显示了同一位置的问题。所以我认为这可能是长度问题。我不知道在BS中解析的HTML代码是否存在某种限制。

我发现了类似的问题：BeautifulSoup, where are you putting my HTML?

安装和使用'html5lib'解析器工作。

soup = BeautifulSoup(text, "html5lib")

新结果显示汤中的所有其他标签。

HTML解析会产生错误的结果

1 个答案: