Question

我正在尝试编写一个可以计算网页上单词的python程序。我使用Beautiful Soup 4来抓取页面，但是我在访问嵌套HTML标签时遇到了困难（例如：<p class="hello">内的<div>）。

每当我尝试使用page.findAll()（页面是包含整页的Beautiful Soup对象）方法找到这样的标签时，它根本找不到任何标签，尽管有。有没有简单的方法或其他方法来做到这一点？

Answer 1

也许我猜你要做的是先查看特定的div标签并搜索其中的所有p标签并计算它们或做任何你想做的事情。例如：

class MainActivity : AppCompatActivity() {
    lateinit var button: Button

    var counter : Int = 0
        set(value) {
            field = value
            observable.onNext(value)
        }
    val observable : BehaviorSubject<Int> = BehaviorSubject.createDefault(counter)

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)

        button = findViewById(R.id.button)

        button.setOnClickListener(View.OnClickListener {
            counter++
        })

        observable.subscribe(
                Consumer { t ->
                    Log.d("fromObservable", counter.toString()) }
        )
    }
}

希望有所帮助

Answer 2

更新：我注意到文本并不总是返回预期的结果，同时，我意识到有一种内置的方式来获取文本，确定阅读{{3} } 我们读到有一个叫做get_text（）的方法，用它作为：

from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))

错误，请阅读上面的内容。假设您在index.html中本地拥有您的html文件，您可以：

from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)

count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
    continue
    temp = matcher.split(tag.text) # Split using tokens such as \s and \n
    temp = filter(None, temp) # remove empty elements in the list
    count +=len(temp)
print "number of words in the document %d" %count
fd.close()

请注意，它可能不准确，可能是因为格式错误，误报（它检测到任何单词，即使它是代码），使用javascript或css动态显示的文本，或其他原因

Answer 3

尝试这个：

data = []
for nested_soup in soup.find_all('xyz'):
    data = data + nested_soup.find_all('abc')
# data holds all shit together

也许您可以将其变成lambda并使其变酷，但这是可行的。谢谢。

Answer 4

您不需要编写 for 循环。如果你愿意，你可以把汤嵌套起来。

BeautifulSoup(
    str(BeautifulSoup(page_source, 'html.parser').findAll('div')),
    'html.parser'
    ).findAll('p', {'class': 'hello'})

美丽的汤嵌套标签搜索

4 个答案: