Question

这是变量“extract”的一个例子，当我执行我的python代码时： Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland

当我执行我的代码时，我现在得到文本'Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'
我真正想要的是'title=' 之后的部分，所以这段文字：'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten。 Dat heeft demeteologische dienst van het land laten weten。 De uitbarsting 是 vooralsnog beperkt, maar er zijn wel twee grote lavastromen。 De voorbije weken 是 IJsland al opgeschrikt door tienduizenden aardbevingen。在 de loop van de dag 是 de kracht van de uitbarsting afgenomen。'

我是这个部分的新手，我觉得很难理解。有人能给我一个好的方向吗？

此时查看我的代码。

import requests
from bs4 import BeautifulSoup

url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
#print(soup)

content=''
rows = soup.find_all('a')
tel=0
for row in rows:
    if tel != 0:
        #'tel' is only used to skip the first returned result. The first resultline is something that i don't need. I only need all the text after every 'title ='

        print(row)
        extract=row.get_text()
        print('')
        print(extract)
        content=content+extract+'\n'

    
    if tel == 10:
        #a loop of max 10 times gives me enough information, i only need the first 10 articles
        break
    else:
        tel=tel+1
        
#show me the result-text of the crawling
print('')
print('RESULT TEXT OF THE FIRST 10 ARTICLES:')
print(content)

Answer 1

您需要标题属性。从描述中，nth-child 的以下使用将为您提供前 10 个描述。我还需要“lxml”解析器。 pip3 install lxml 如果未安装。

import requests
from bs4 import BeautifulSoup

url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
print([i['title'] for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)')])

CSS：

li:nth-child(-n+10) > a:nth-child(2)

这会要求前 10 个 li 元素具有至少两个 a 标记子项，并在每种情况下选择第二个 a 标记。 > 是一个子组合子，指定右边的必须是左边的子。

阅读：

对带有已发布日期时间信息的 for loop 的附加请求：

import requests
from bs4 import BeautifulSoup

url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)'):
    print(i.parent['title'])
    print(i.parent.a.text.strip())
    print(i['title'])
    print()

Answer 2

如果我正确理解您的问题，您应该尝试如下修改您的代码：

...
for row in rows:
    if tel != 0:
        print(row)
        extract=row["title"]
        content=content+extract.replace("Meer informatie", "")+'\n'
...

Answer 3

快速查看您在做什么，只需将 kustomize build . | kubectl apply -f - 替换为 get_text()，就可以了：

.get('title')

网页抓取网页时无法获得正确的文本（使用 Python 3）

3 个答案: