我想从Quora或带有代码的通用帖子解析这篇文章 示例:http://qr.ae/Rkplrt
通过使用Selenium,一个Python库,我可以在帖子中获取HTML:
h = html2text.HTML2Text()
content = ans.find_element_by_class_name('inline_editor_value')
html_string = content.get_attribute('innerHTML')
text = h.handle(html_string)
print text
我希望这一切都是一大块文字。但是对于那些包含代码的表,html2text会插入许多\n
并且不处理行的索引。
所以我可以看到这一点:
https://imageshack.com/i/paEKbzT4p(这是包含带代码的表的主要div。)
https://imageshack.com/i/hlIxFayop(html2text提取的文字)
https://imageshack.com/i/hlHFBXvQp(相反,这是文本的最终打印,索引行和额外\n
的问题。)
我已经尝试过不同的设置,例如bypasse_tables,出现在github上的本指南中:(https://github.com/Alir3z4/html2text/blob/master/docs/usage.md#available-options),但没有成功。
在这种情况下有人可以告诉我如何使用html2text吗?
答案 0 :(得分:1)
您实际上根本不需要使用HTML2Text
。
selenium
可以直接为您提供“文字”:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://qr.ae/Rkplrt")
print(driver.find_element_by_class_name('inline_editor_content').text)
它打印帖子的内容:
The single line of code must be useful, not something meant to be confusing or obfuscating.
...
What examples have you created or encountered ?
答案 1 :(得分:1)
您可以使用BeautifulSoup
(我使用urllib
与网站进行通信,因为我不熟悉selenium
,但我确定它可以工作)来做一些简单的HTML解析:
import urllib
from bs4 import BeautifulSoup
# urllib opener
opener = urllib.request.build_opener(
urllib.request.HTTPRedirectHandler(),
urllib.request.HTTPHandler(debuglevel=0),
urllib.request.HTTPSHandler(debuglevel=0))
# Get page
html = opener.open("http://qr.ae/Rkplrt").read()
# Create BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
# Find the HTML element you want
answer = soup.find('div', { 'class' : 'ExpandedQText ExpandedAnswer' })
# Remove the stuff you don't want
answer.find('td', { 'class' : 'linenos' }).extract()
answer.find('div', { 'class' : 'ContentFooter AnswerFooter' }).extract()
# Print
print("\n".join(answer.stripped_strings))
我不完全确定你想要提取什么。上面给出了答案,包括代码,没有行号:
This is:
#include <stdio.h>
int v,i,j,k,l,s,a[99];
main()
{
for(scanf("%d", &s);*a-s;v=a[j*=v]-a[i],k=i<s,j+=(v=j<s&&(!k&&!!printf(2+"\n\n%c"-(!l<<!j)," #Q"[l^v?(l^j)&1:2])&&++l||a[i]<s&&v&&v-i+j&&v+i-j))&&!(l%=s),v||(i==j?a[i+=k]=0:++a[i])>=s*k&&++a[--i]);
}
更新: OP要求<a>
和<img>
代码替换为href
和src
值。我下面的脚本版本应该处理这个问题。它还处理多个答案。
import urllib
from bs4 import BeautifulSoup
# urllib opener
opener = urllib.request.build_opener(
urllib.request.HTTPRedirectHandler(),
urllib.request.HTTPHandler(debuglevel=0),
urllib.request.HTTPSHandler(debuglevel=0))
# Get page
html = opener.open("https://www.quora.com/Is-it-too-late-for-an-X-year-old-to-learn-how-to-program").read()
# Create BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
# Place to store the final output
output = ''
# Find the HTML element you want
answers = soup.find_all('div', { 'class' : 'ExpandedQText ExpandedAnswer' })
for answer in answers:
# Remove the stuff you don't want
linenos = answer.find('td', { 'class' : 'linenos' })
if linenos is not None:
linenos.extract()
answer.find('div', { 'class' : 'ContentFooter AnswerFooter' }).extract()
# Replace <a> with its url
for link in answer.select('a'):
url = link['href']
link.insert_after(url)
link.extract()
# Replace <a> with its url
for img in answer.select('img'):
url = img['src']
img.insert_after(url)
img.extract()
# Attach to output
output += "\n".join(answer.stripped_strings) + '\n\n'
# Print
print(output)