我正试图从某些时候抓取一篇文章,而我所针对的课程却无法正常工作。我无法找出问题所在
def timeParse(link):
page = requests.get(http://time.com/5556373/jared-kushner-ivanka-trump-private-email-whatsapp/)
tree = html.fromstring(page.content)
print(tree)
word = tree.xpath('//*[@class="article"]')
print(word)
title = tree.xpath('//h1[@class="headline"]')
print(title.text)
articleContent = {}
contentList = []
pTag = word[0].xpath('//p')
print(pTag[0])
for x in range(len(word)):
print(word[x].text)
contentList.append(word[x].text)
articleContent["content"] = contentList
articleContent["title"] = title[0].text
return articleContent
答案 0 :(得分:2)
网页使用JavaScript渲染,并且有一个登陆页面要求您同意条件。您可以使用硒来抓取它,以呈现JavaScript:
首先安装Selenium。
sudo pip3 install selenium
(在Windows上不需要sudo,您可能希望pip而不是pip3)
然后获取驱动程序https://sites.google.com/a/chromium.org/chromedriver/downloads(根据您的操作系统,您可能需要指定驱动程序的位置)
from selenium import webdriver
from lxml import html
import time
def timeParse(link):
browser = webdriver.Chrome()
browser.get(link)
time.sleep(3)
browser.find_element_by_xpath("//input[@value='Continue']").click()
time.sleep(3)
tree = html.fromstring(browser.page_source)
#print(html.tostring(tree))
word = tree.xpath('//*[@id="article-body"]')
#print( word[0].text)
title = tree.xpath('//h1[@class="headline heading-content margin-8-top margin-16-bottom"]')
#print(title[0].text)
articleContent = {}
contentList = []
pTag = word[0].xpath('//p')
#print(pTag[0].text)
for x in range(len(pTag)):
#print(pTag[x].text)
contentList.append(pTag[x].text)
articleContent["content"] = contentList
articleContent["title"] = title[0].text
return articleContent
print(timeParse("http://time.com/5556373/jared-kushner-ivanka-trump-private-email-whatsapp/"))
输出:
{'content': ['House Democrats claim that President Donald Trump’s son-in-law used an encrypted messenger app for official White House business in what would be a serious breach of records laws.', 'House Oversight Chairman Elijah Cummings sent a letter to White House Counsel Pat Cipollone informing him that he had learned from Kushner’s lawyer, Abbe Lowell, that ', 'According to Cummings, Lowell “could not answer” questions about whether those communications included classified information, which would be a serious breach of security protocol.', 'In addition, the Maryland Democrat said that had also learned from Lowell that Ivanka Trump is still receiving work-related emails on her personal email account that she does not forward to her official White House account.', 'If true, the claims would raise questions about the White House’s handling of online security and classified information, the very same charges that Trump successfully used against Hillary Clinton over her private email server in the 2016 election.', 'On Thursday, Lowell sent an email to Cummings denying that he ever told the chairman Kushner had ever communicated with foreign officials, and that he had followed the proper protocol for classified information. He also denied that he told the committee that Ivanka Trump does not forward relevant personal emails.', 'The committee also said it obtained documents showing former top White House strategist Steve Bannon and former Deputy National Security Adviser K.T. McFarland had used their personal email accounts for White House business regarding the transfer of nuclear technology to Saudi Arabia.', 'Cummings requested documentation into the use of email by March 28. But, judging by the precedent that has been set by the White House, he is unlikely to receive it. As he himself noted in Thursday’s letter, the initial deadline of January 11 for these documents came and went.', '“As you know, the White House has not produced a single piece of paper to the committee in the 116th Congress — in this or any other investigation,” Cummings wrote.', 'These actions, Cummings argued in his letter, are “obstructing the Committee’s investigation into allegations of violations of federal records laws by White House officials.” The sentiments were a reiteration of the op-ed he had written in the Washington ', 'But, if Cummings needed any more ammunition to prove his point, he quickly got it. Separately on Thursday, Cipollone sent a reply to Cummings and two of his colleagues rejecting a request for documents on another matter: transcripts of the President’s conversations with Russian counterpart Vladimir Putin. As he has before, Cipollone argued the request was an overreach beyond Congress’ constitutional responsibility.', '“While we respectfully seek to accommodate appropriate oversight requests, we are unaware of any precedent supporting such sweeping requests,” he wrote in a letter to Cummings, House Permanent Select Committee on Intelligence Chairman Adam Schiff, and Committee on Foreign Affairs Chairman Eliot Engel. “Rather, the Supreme Court and administrations of both parties have consistently recognized that the conduct of foreign affairs is a matter that the Constitution assigns exclusively to the President.”', 'In a statement, the three lawmakers said the response “continues a troubling pattern by the Trump Administration of rejecting legitimate and necessary congressional oversight with no regard for precedent or the constitution.”', 'The back-and-forth exchanges have followed a fairly predictable pattern and foreshadow what could be the biggest showdown of all: the fight to see the entirety of ', 'Jurisdiction over that is squarely in the hands of Attorney General William Barr, rather than the White House Counsel’s office. But ', 'There is no telling what Barr may do. Republicans, ', None, None], 'title': 'Jared Kushner Used Encrypted App to Communicate With Foreign Leaders, Democrats Claim'}