Question

我有一个包含大量文本的网页，我想从页面中提取文本并将其写入文件。我正在尝试使用BeautifulSoup，但我不确定它能轻松做到我想要的。这是故事：我相信我想提取的文字介于：

之间

<td colspan="2" class="msg_text_cell" style="text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;" rowspan="2" valign="top" width="100%">

和

<p></p><div style="overflow: hidden; width: 550px; height: 48px;">

我想要做的只是选择之间的文本行，但不包括上面的开始和结束文本。请注意，上面的开始html本身就是一行，但结尾文本有时会出现在我想要的最后一个文本之后但不在新行上。

我似乎无法看到如何用BeautifulSoup做我想做的事，但可能是我不熟悉阻碍了。

另外，我要提取的文字在页面中出现了50次，所以我希望所有这些文字被'+++++++++++++++++++++++++++++++++ +'使其更易于阅读。

非常感谢你的帮助。

Answer 1

如果您知道Ruby的缺点，我可以指向Nokogiri，这是一个令人惊叹的屏幕抓取宝石。

Answer 2

简单地说你可以遍历包含你想要的文本的预期dom元素并以那种方式提取它...使用jquery类似于$（'td.msg_text_cell'）。each（function（idx，el）{ idx将是从上面的选择器找到的jQuery对象数组中的索引，获取所有tds类的msg_text_cell ... }）

你也可以用原生js做，所以不要以为我在推jquery ......只是一个我更熟悉的框架

Answer 3

您可以使用BeautifulSoup轻松完成

from bs4 import BeautifulSoup as bs
soup = "<td colspan=\"2\" class=\"msg_text_cell\" style=\"text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;\" rowspan=\"2\" valign=\"top\" width=\"100%\"> <p>The text</p><div style=\"overflow: hidden; width: 550px; height: 48px;\">"
soup = bs(soup)
soup.find('p')

您现在可以在

标签

中找到类似文字的内容

Output: <p>The text</p>

您现在可以添加循环来修改变量。

然后您可以保存在文件中。

with open("data.csv","w") as tW:
writer = csv.writer(tW,delimiter=",")
writer.writerow(["Ptag"])
for i in soup:
    p = i.get_text()
    writer.writerow([p])

从html文件中提取文本？

3 个答案: