Question

我试图刮擦LexisNexis。我想检索新闻报道的标题，出处和日期。这是我在使用硒进行搜索之后编写的代码。我无法将数据保存到csv文件中。我不断收到编码错误。当我没有出现编码错误时，我得到的数据带有很多空格和奇怪的字符，例如\ t \ t \ t \ t \ t \ t \ t \和\ n。

以下是我检索到的示例：

[“ \ n \ t \ t \ t \ tNetworks继续为印第安纳州的宗教自由法引发“风暴” \ n \ t \ t \ t”，“ \ n \ t \ t \ t \ t网络在印第安纳州的“有争议的”法律上堆积\ n \ t \ t \ t“，” \ n \ t \ t \ t \ tABC继续痴迷于抨击'有争议的''宗教自由'法案\ n \ t \ t \ t”， “ \ n \ t \ t \ t \ tABC，NBC急于将特朗普描绘成“温和”，“特朗普2.0” \ n \ t \ t \ t”，“ \ n \ t \ t \ t \ tCBS恐慌按钮，反对佐治亚州诺尔宗教自由法案的铁路h卡罗莱纳州\ n \ t \ t \ t'，'\ n \ t \ t \ t \ t圣战报告-2016年10月7日\ n \ t \ t \ t'，'\ n \ t \ t \ t \ t教育新闻摘要：2016年5月2日\ n \ t \ t \ t'，'\ n \ t \ t \ t \ tNBC CBS对宗教自由法的持续攻击\ n \ t \ t \ t'，'\ n \ t \ t \ t \ tNBC抨击印第安纳州的宗教自由法...然后开始对信仰进行为期一周的长期\ n \ t \ t \ t'，“ \ n \ t \ t \ t \ tNetworks再次抨击印第安纳州，国家对宗教神父的强烈抗议和“不屑一顾” eedom Law \ n \ t \ t \ t“]

标题，日期和来源均属于这种情况。我不确定我在做什么错。

scd =browser.page_source
soup = BeautifulSoup(scd, "lxml")


headlines=[]
for headline in soup.findAll('a', attrs={"data-action":"title"}):
 head_line=headline.get_text()
 #head_line.strip('a>, <a data-action="title" href="#">')
 #head_line.encode('utf-8')
 Headlines = head_line.encode()
 headlines.append(head_line)

sources=[]        
 for sources in soup.findAll('a', attrs{"class":"rightpanefiltercontent notranslate", "href":"#"}):
source_only=sources.get_text()
source_only.encode('utf-8')
sources.append(source_only)
Sources = sources.encode()

dates=[]          
for dates in soup.findAll('a', attrs={"class":"rightpanefiltercontent"}):
date_only=dates.get_text()
date_only.strip('<a class="rightpanefiltercontent" href="#">')
date_only.encode()
dates.append(date_only)
Dates = dates.encode()

news=[Headlines,Sources,Dates]


result = "/Users/danashaat/Desktop/Tornadoes/IV Search News Results/data.csv"
with open(result, 'w') as result:
newswriter = csv.writer(result, dialect='excel') 
newswriter.writerow(News)

此外，这是当我找到标题时的结果：

[<a data-action="title" href="#"> Networks Continue Hammering Indiana for Sparking a 'Firestorm' Over Religious Freedom Law </a>, <a data-action="title" href="#"> All Three Networks Pile on Indiana's 'Controversial' Law </a>, <a data-action="title" href="#"> ABC Continues Obsessively Bashing 'Controversial' 'Religious Freedom' Bill </a>, <a data-action="title" href="#"> ABC, NBC Rush to Paint Trump as a 'Moderate,' 'Trump 2.0' </a>, <a data-action="title" href="#"> CBS Hits the Panic Button, Rails Against Religious Freedom Bills in Georgia, North Carolina </a>, <a data-action="title" href="#"> Jihad Report - October 7, 2016 </a>, <a data-action="title" href="#"> Education News Roundup: May 2, 2016 </a>, <a data-action="title" href="#"> NBC CBS Keep Up Attack on Religious Freedom Laws </a>, <a data-action="title" href="#"> NBC Slams Indiana Religious Freedom Law...Then Starts Week-Long Series on Faith </a>, <a data-action="title" href="#"> Networks Again Bash Indiana for Causing 'National Outcry' and 'Uproar' Over Religious Freedom Law </a>]

我一直在努力解决这一问题，因此，我们将不胜感激。

Answer 1

您可以将元素搜索锚定到div class "item"：

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import csv
d = webdriver.Chrome()
d.get('https://www.lexisnexis.com/en-us/home.page')
results = [[(lambda x:x['href'] if i == 'a' else getattr(x,'text', None))(c.find(i)) for i in ['a', 'time', 'h5', 'p']] for c in soup(d.page_source, 'html.parser').find_all('div', {'class':'item'})]
with open('lexisNexis.csv', 'w') as f:
  write = csv.writer(f)
  write.writerows([['source', 'timestamp', 'tags', 'headline'], *[re.findall('(?<=//www\.)\w+(?=\.com)', a)+b for a, *b in results if all([a, *b])]])

输出：

source,timestamp,tags,headline
law360,04 Sep 2018,Labor & Employment Law,11th Circ. Revives Claim In Ex-Aaron's Worker FMLA Suit
law360,04 Sep 2018,Workers' Compensation,Back To School: Widener's Rod Smolla Talks Free Speech
law360,04 Sep 2018,Tax Law,Ex-Sen. Kyl Chosen To Take Over McCain's Senate Seat
law360,04 Sep 2018,Energy,Mass. Top Court Says Emission Caps Apply To Electric Cos.
lexisnexis,04 Sep 2018,Immigration Law,Suspension of Premium Processing: Another Attack On the H-1B Program (Cyrus Mehta)
law360,04 Sep 2018,Real Estate Law,Privilege Waived For Some Emails In NJ Real Estate Row
law360,04 Sep 2018,Banking & Finance,Cos. Caught Between Iran Sanctions And EU Blocking Statute
law360,04 Sep 2018,Mergers & Acquisitions,Former Paper Co. Tax VP Sues For Severance Pay After Merger

如何保存Web抓取Python的结果

1 个答案: