使用HTML标记递归73033元素的列表并从中获取上下文

时间:2016-07-21 13:54:04

标签: python regex list

我有很长的元素列表,长度为73,033。我想从中获取上下文。在列表中,每个元素具有相同的结构(如果以下代码的块有帮助),它看起来像<div align="center" class="photocaption"> Author/Designer Carleton Varney with Jim Druckman </div>。我感兴趣的是文本Author/Designer Carleton Varney with Jim Druckman.

主要代码

NewSoups = [BeautifulSoup(NewR) for NewR in NewRs]. 
captions = [soup.find_all("div", class_ = "photocaption") for soup in NewSoups]
flattened_captions = []
for x in captions:
    for y in x:
        flattened_captions.append(y)

print(len(flattened_captions)) #73033

import re
results = [re.sub('<[^>]*>', '', y) for y in flattened_captions] #where the error comes from

错误

Traceback (most recent call last):
File "picked.py", line 22, in <module>
results = [re.sub('<[^>]*>', '', y) for y in flattened_captions]
File "/opt/conda/lib/python2.7/re.py", line 155, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

我想知道是否有一种方便的方法来遍历<div ></div>的长列表。如果没有,那么提取我想要的所有文本的最佳方法是什么?非常感谢你。

1 个答案:

答案 0 :(得分:0)

我要发布的内容并不是处理已发布问题的最优雅或最有效的方法。正如Welbog指出的那样,BeautifulSoup本身提供了提取上下文的功能。然而,当我在发布原始问题时收到错误时,我只是好奇错误的来源。事实证明,flattented_captions返回的东西不是字符串。解决起来非常简单。方法如下。

str_flattened_captions = [str(flattened_captions[i]) for i in range(len(flattened_captions))]

gains = [re.sub('<[^>]*>', '', item) for item in str_flattened_captions]

测试

print(gains[:5])
r Barbara Schorr ', ' Architect Joan Dineen with Alyson Liss ', ' Author/Designer Carleton Varney with Jim Druckman ', ' Designers Richard Cerrone, Lisa Hyman and Rhonda Eleish (front) in their room called "Holiday Nod To Nature" ']