Question

这很难解释。

我正在使用BeautifulSoup获取一些网页，并且我试图将它们组织到一个列表中。我只获取页面上具有类＆＃34; text＆＃34;的元素。像这样：

content = requests.get(url, verify=True)
soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer('p'))
filtered_soup = soup.find_all("span", {"class":["text",
                                                "indent-1"]})
line_list = [line for line in filtered_soup]
#text_list = [line.get_text() for line in filtered_soup]

这很有效，但我也希望合并列表中的一些项目。在网页上，class="text..."的某些项目也有id="en..."。他们在技术上应该是其他class="text..."元素的父母，但网页还没有这样设置。

在我的＆＃34; line_list＆＃34;列表，有一个包含class="text..."和id="en..."元素的项目，然后有一些项目只有class="text..."，那么有一个项目包含class="text..."和{{ 1}}元素，这种模式不断重复。这是一种思考方式：

id="en..."

现在这里很难解释。让我们说line_list = [A, a, a, a, B, b, b, C, c, c, c, c]有两个元素，line_list[0]只有＆＃34;类＆＃34;元素，而line_list[1-3]又有两个元素。我想遍历line_list[4]并将这些项目组合成一个字符串。但是当迭代命中包含＆＃34; id＆＃34;和＆＃34;班＆＃34; （即line_list），我希望开始创建一个新字符串。

或者，如果有人能想出更好的方法来做到这一点，那就太棒了。我打算尝试这样做：

line_list[4]

但是line_string = ''.join(line_list) split_list = line_string.split('id="en')命令抱怨join包含标签，而不是字符串。

我想知道用字典做这件事会更容易吗？例如，创建具有＆＃34; class＆＃34;的元素。和＆＃34; id＆＃34;键和元素只有＆＃34; class＆＃34;他们的价值观它看起来像这样：

line_string

这是一个例子html，如果有人想玩它：

line_dic = {A: [a, a, a], B: [b, b], C: [c, c, c, c]}

很棒的想法，伙计们。万分感谢！

Answer 1

不是很酷的单行，但是，以下应该有效......：

text_list = []
current = []
for line in line_list:
    if line.get('id', '').startswith('en'):
        if current:
            text_list.append(' '.join(current))
            current = []
    current.append(line.text)
if current:
    text_list.append(' '.join(current))

例如，在

的示例测试开始后添加此代码

import bs4

content = '''
<span class='text' class='indent-1' id='en00'>And one</span>
<span class='text' class='indent-1'>And two</span>
<span class='text' class='indent-1'>And three</span>
<span class='text' class='indent-1' id='en01'>And four</span>
<span class='text' class='indent-1'>And five</span>
'''

soup = bs4.BeautifulSoup(content)
filtered_soup = soup.find_all("span", {"class":["text", "indent-1"]})
line_list = [line for line in filtered_soup]

for x in test_list: print(x)将显示

And one And two And three
And four And five

似乎与期望的结果相符。

补充说：这是一个可以说是更好的因素解决方案，但最终会变得更加冗长：

def has_id_en(elem):
    return elem.get('id', '').startswith('en')

def segment(sequence, is_head):
  current = []
  for x in sequence:
      if is_head(x):
          if current:
              yield current
              current = []
      current.append(x)
  if current:
      yield current

text_list = [' '.join(e.text for e in bunch)
             for bunch in segment(line_list, has_id_en)]

至少，这样，segment逻辑可以重复用于类似的任务，其中序列中的项目不必是bs4对象，和/或确定项目是否需要＆＃34; head＆＃34;子序列与此特定问题不同。

Answer 2

您可以使用itertools.groupby，如下所示：

import itertools

def has_id_en(elem):
    # return True if the elem has id="en..."
    ...

for is_id_en, elems in itertools.groupby(filtered_soup, has_id_en):
    if is_id_en:
        # this is the parent
        continue
    else:
        # do somthing with this group of elems
        ...

Answer 3

起初我考虑过使用itertools.takewhile，但这有一个问题，就是它吞下了＆＃34;下一个＆＃34;分隔符＆＃34;元件。相反，您可以尝试使用内置函数：

def has_both(x):
    return x.isupper() # or whatever your actual condition is

line_dic = {}
last = None
for x in line_list:
    if has_both(x):
        last = x
        line_dic[last] = []
    else:
        line_dic[last].append(x)

结果为{'A': ['a', 'a', 'a'], 'C': ['c', 'c', 'c', 'c'], 'B': ['b', 'b']}

对于Python 2.7及更高版本，您还可以使用collections.OrderedDict来保留项目插入字典的顺序。此外，如果您希望看到＆＃34; child＆＃34;任何＆＃34;父母＆＃34;之前的元素元素，将line_dic初始化为{None: []}。

合并列表中的项目，直到找到包含特定文本的项目？

3 个答案: