Question

我正在抓取总统辩论的笔录。我注意到，当我的抓取工具提取html元素时，它从不提取段落结尾标记（def transpose(any_matrix): _row = len(any_matrix) _col = len(any_matrix[0]) temp_matrix = [] #multiplies [0] by the number of rows (old) to create new row temp_row = [0]*_row #creates matrix with number of columns as rows for x in range(_col): temp_matrix += [temp_row.copy()] for r in range(len(any_matrix)): for c in range(len(any_matrix[0])): value = any_matrix[r][c] temp_matrix[c][r] = value return temp_matrix a = [[4, 5, 6], [7,8,9]] print(transpose(a)) #input [[4,5,6] # [7,8,9]] #correct answer [ [4,7], # [5,8], # [6,9] ]）。

例如

在浏览器中检查源

</p>

我认为发生了以下两种情况之一：

urllib会以某种方式删除结束标记（仅对段落而言，其余都很好）
原始资源不包含结束标记，浏览器正在填充它们。

我如何确定它是哪一个，然后对其进行纠正？

Answer 1

您可以检查Chrome收到的实际数据包吗？在某些情况下，Chrome会检测到并纠正此类小遗漏以显示页面，即使它们不在数据包中也是如此。我的猜测是Chrome修复了此问题，而实际来源却很糟糕。

urllib返回html但没有结束段落标签

1 个答案: