Question

我尝试从网页上获取歌曲文本。下面，我有两种实现方法的版本，因为有了第一个，我只能从第一个<p>段落中获取文本，但是有时在div类歌集中有多个<p>。在第二个版本中，我实现了这一点，但是它包含了整个html。 “ .text”仅在只有一项而不是多项（列表）的情况下有效。

我在这里迷路了，还是Python和BeautifulSoup的新手，所以非常感谢您的帮助。

#Extract the songtext only and save it in file
 url = urllib.request.urlopen('https://www.udo- 
 lindenberg.de/mit_dir_sogar_n_kind.57754.htm')
 content = url.read()
 soup = BeautifulSoup(content, 'lxml')

 #search on page for div class block songbook and extract songtext between <p>
 table = soup.find_all('div', attrs={"class":"block songbook"})
 for item in table:
     sys.stdout = open('output.txt','wt')
     songtext = item.find('p').text
     print(item.find('p').text)

#extracts the songtext with html markers
page_link = 'https://www.udo-lindenberg.de/mit_dir_sogar_n_kind.57754.htm'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
textContent = []
for i in range(0,200):
    paragraphs = soup.find_all('div', attrs={"class":"block songbook"})
    textContent.append(paragraphs)
    sys.stdout = open('output2.txt','wt')
    print(paragraphs)

Answer 1

好的，我自己解决了。我发现了错误。第二版本的行：

paragraphs = soup.find_all('div', attrs={"class":"block songbook"})

必须更改为：

paragraphs = soup.find('div', attrs={"class":"block songbook"}).text

从div类中的网页中使用Python提取文本

1 个答案: