我有很长的元素列表,长度为73,033。我想从中获取上下文。在列表中,每个元素具有相同的结构(如果以下代码的块有帮助),它看起来像<div align="center" class="photocaption"> Author/Designer Carleton Varney with Jim Druckman </div>
。我感兴趣的是文本Author/Designer Carleton Varney with Jim Druckman.
主要代码
NewSoups = [BeautifulSoup(NewR) for NewR in NewRs].
captions = [soup.find_all("div", class_ = "photocaption") for soup in NewSoups]
flattened_captions = []
for x in captions:
for y in x:
flattened_captions.append(y)
print(len(flattened_captions)) #73033
import re
results = [re.sub('<[^>]*>', '', y) for y in flattened_captions] #where the error comes from
错误
Traceback (most recent call last):
File "picked.py", line 22, in <module>
results = [re.sub('<[^>]*>', '', y) for y in flattened_captions]
File "/opt/conda/lib/python2.7/re.py", line 155, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
我想知道是否有一种方便的方法来遍历<div ></div>
的长列表。如果没有,那么提取我想要的所有文本的最佳方法是什么?非常感谢你。
答案 0 :(得分:0)
我要发布的内容并不是处理已发布问题的最优雅或最有效的方法。正如Welbog指出的那样,BeautifulSoup本身提供了提取上下文的功能。然而,当我在发布原始问题时收到错误时,我只是好奇错误的来源。事实证明,flattented_captions返回的东西不是字符串。解决起来非常简单。方法如下。
str_flattened_captions = [str(flattened_captions[i]) for i in range(len(flattened_captions))]
gains = [re.sub('<[^>]*>', '', item) for item in str_flattened_captions]
测试
print(gains[:5])
r Barbara Schorr ', ' Architect Joan Dineen with Alyson Liss ', ' Author/Designer Carleton Varney with Jim Druckman ', ' Designers Richard Cerrone, Lisa Hyman and Rhonda Eleish (front) in their room called "Holiday Nod To Nature" ']