我试图刮擦LexisNexis。我想检索新闻报道的标题,出处和日期。这是我在使用硒进行搜索之后编写的代码。我无法将数据保存到csv文件中。我不断收到编码错误。当我没有出现编码错误时,我得到的数据带有很多空格和奇怪的字符,例如\ t \ t \ t \ t \ t \ t \ t \和\ n。
以下是我检索到的示例:
[“ \ n \ t \ t \ t \ tNetworks继续为印第安纳州的宗教自由法引发“风暴” \ n \ t \ t \ t”,“ \ n \ t \ t \ t \ t网络在印第安纳州的“有争议的”法律上堆积\ n \ t \ t \ t“,” \ n \ t \ t \ t \ tABC继续痴迷于抨击'有争议的''宗教自由'法案\ n \ t \ t \ t”, “ \ n \ t \ t \ t \ tABC,NBC急于将特朗普描绘成“温和”,“特朗普2.0” \ n \ t \ t \ t”,“ \ n \ t \ t \ t \ tCBS恐慌按钮,反对佐治亚州诺尔宗教自由法案的铁路h卡罗莱纳州\ n \ t \ t \ t','\ n \ t \ t \ t \ t圣战报告-2016年10月7日\ n \ t \ t \ t','\ n \ t \ t \ t \ t教育新闻摘要:2016年5月2日\ n \ t \ t \ t','\ n \ t \ t \ t \ tNBC CBS对宗教自由法的持续攻击\ n \ t \ t \ t','\ n \ t \ t \ t \ tNBC抨击印第安纳州的宗教自由法...然后开始对信仰进行为期一周的长期\ n \ t \ t \ t',“ \ n \ t \ t \ t \ tNetworks再次抨击印第安纳州,国家对宗教神父的强烈抗议和“不屑一顾” eedom Law \ n \ t \ t \ t“]
标题,日期和来源均属于这种情况。我不确定我在做什么错。
scd =browser.page_source
soup = BeautifulSoup(scd, "lxml")
headlines=[]
for headline in soup.findAll('a', attrs={"data-action":"title"}):
head_line=headline.get_text()
#head_line.strip('a>, <a data-action="title" href="#">')
#head_line.encode('utf-8')
Headlines = head_line.encode()
headlines.append(head_line)
sources=[]
for sources in soup.findAll('a', attrs{"class":"rightpanefiltercontent notranslate", "href":"#"}):
source_only=sources.get_text()
source_only.encode('utf-8')
sources.append(source_only)
Sources = sources.encode()
dates=[]
for dates in soup.findAll('a', attrs={"class":"rightpanefiltercontent"}):
date_only=dates.get_text()
date_only.strip('<a class="rightpanefiltercontent" href="#">')
date_only.encode()
dates.append(date_only)
Dates = dates.encode()
news=[Headlines,Sources,Dates]
result = "/Users/danashaat/Desktop/Tornadoes/IV Search News Results/data.csv"
with open(result, 'w') as result:
newswriter = csv.writer(result, dialect='excel')
newswriter.writerow(News)
此外,这是当我找到标题时的结果:
[<a data-action="title" href="#">
Networks Continue Hammering Indiana for Sparking a 'Firestorm' Over Religious Freedom Law
</a>, <a data-action="title" href="#">
All Three Networks Pile on Indiana's 'Controversial' Law
</a>, <a data-action="title" href="#">
ABC Continues Obsessively Bashing 'Controversial' 'Religious Freedom' Bill
</a>, <a data-action="title" href="#">
ABC, NBC Rush to Paint Trump as a 'Moderate,' 'Trump 2.0'
</a>, <a data-action="title" href="#">
CBS Hits the Panic Button, Rails Against Religious Freedom Bills in Georgia, North Carolina
</a>, <a data-action="title" href="#">
Jihad Report - October 7, 2016
</a>, <a data-action="title" href="#">
Education News Roundup: May 2, 2016
</a>, <a data-action="title" href="#">
NBC CBS Keep Up Attack on Religious Freedom Laws
</a>, <a data-action="title" href="#">
NBC Slams Indiana Religious Freedom Law...Then Starts Week-Long Series on Faith
</a>, <a data-action="title" href="#">
Networks Again Bash Indiana for Causing 'National Outcry' and 'Uproar' Over Religious Freedom Law
</a>]
我一直在努力解决这一问题,因此,我们将不胜感激。
答案 0 :(得分:1)
您可以将元素搜索锚定到div class
"item"
:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import csv
d = webdriver.Chrome()
d.get('https://www.lexisnexis.com/en-us/home.page')
results = [[(lambda x:x['href'] if i == 'a' else getattr(x,'text', None))(c.find(i)) for i in ['a', 'time', 'h5', 'p']] for c in soup(d.page_source, 'html.parser').find_all('div', {'class':'item'})]
with open('lexisNexis.csv', 'w') as f:
write = csv.writer(f)
write.writerows([['source', 'timestamp', 'tags', 'headline'], *[re.findall('(?<=//www\.)\w+(?=\.com)', a)+b for a, *b in results if all([a, *b])]])
输出:
source,timestamp,tags,headline
law360,04 Sep 2018,Labor & Employment Law,11th Circ. Revives Claim In Ex-Aaron's Worker FMLA Suit
law360,04 Sep 2018,Workers' Compensation,Back To School: Widener's Rod Smolla Talks Free Speech
law360,04 Sep 2018,Tax Law,Ex-Sen. Kyl Chosen To Take Over McCain's Senate Seat
law360,04 Sep 2018,Energy,Mass. Top Court Says Emission Caps Apply To Electric Cos.
lexisnexis,04 Sep 2018,Immigration Law,Suspension of Premium Processing: Another Attack On the H-1B Program (Cyrus Mehta)
law360,04 Sep 2018,Real Estate Law,Privilege Waived For Some Emails In NJ Real Estate Row
law360,04 Sep 2018,Banking & Finance,Cos. Caught Between Iran Sanctions And EU Blocking Statute
law360,04 Sep 2018,Mergers & Acquisitions,Former Paper Co. Tax VP Sues For Severance Pay After Merger