Question

我有一个网站布局：

<p> section1 </p>
<p> section2 </p>
<pre> section 3 <p> section 4 </p> </pre>
<p> section 5 </p>
<pre> section 6 </pre>
<form> 
<p> section 7 </p>
<textarea> <p> section 8 </p></textarea>

我要获得所有文本，直到第6节（“形式”部分之前的所有内容）。但是，我不能使用findAll（'p'）因为它包含了表单中的所有内容。其他网站也有类似的布局，但在“形式”部分之前的部分更少。我想知道如何使用BeautifulSoup在第6部分之前获得所有部分？感谢

Answer 1

好吧，您可以使用find_all_previous()方法。您可以选择form元素并在该元素之前获取所有p标记。

>>> a = soup.form
>>> a.find_all_previous("p")
[<p> section 5 </p>, <p> section 4 </p>, <p> section2 </p>, <p> section1 </p>]

以上代码可以缩减为

soup.form.find_all_previous("p")

Answer 2

您可以使用：

In[31]: [x for x in soup.form.find_previous_siblings()]
Out[31]: 
[<pre> section 6 </pre>,
 <p> section 5 </p>,
 <pre> section 3 <p> section 4 </p> </pre>,
 <p> section2 </p>,
 <p> section1 </p>]

以下是输出示例：

public function viewAddEmployeePayrollForm() { // input employee's payroll
        $this->load->view('imports/header');
        $this->load->view('imports/menu');
        $this->load->view('payroll/payroll_add');
    }

    public function saveEmployeePayroll() { // save the inputted details
        $this->load->model('Model_payroll');
        $p = new Model_payroll();
        $p['employees'] = $this->db->get('employees')->result();
        $p->emp_id = $this->input->post('empid');
        $p->basic_salary = $this->input->post('emp_salary');
        $result = $p->saveEmployeePayroll();
        if (!$result) {
            echo mysqli_error($result);
        }
        else {
            redirect('home/goViewEmpPayroll', 'refresh');
        }
    }

Answer 3

您可以遍历DOM，直到找到form标记。像这样：

tag = soup.find('p')  # this will give you the first p tag
data = ''
while True:
    if isinstance(tag, bs4.element.Tag):
        if tag.name == 'form':
            break
        else:
            data = data + tag.text  # string concatenation
            tag = tag.nextSibling
    else:
        tag = tag.nextSibling

print data

这将为您提供如下输出：

section2 
section 3  section 4  
section 5 
section 6

美丽的汤找到所有p直到形式

3 个答案: