我有一个包含很多或网址的CSV文件,这些文件都有不同的域扩展名(.com
,.eu
,.org
等等。但我只想在python 2.7中使用.nl
抓取if '.nl' in row:
扩展名的域:
from selenium import webdriver
import csv
fieldnames = ['Website', '@media', 'googleadservices.com/pagead/conversion']
def csv_writerheader(path):
with open(path, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
writer.writeheader()
def csv_writer(dictdata, path):
with open(path, 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
writer.writerow(dictdata)
csv_output_file = 'output!.csv'
driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe')
keywords = ['@media', 'googleadservices.com/pagead/conversion']
csv_writerheader(csv_output_file)
with open('top1m-edited.csv') as example_file:
example_reader = csv.reader(example_file)
for row in example_reader:
# INITIALIZE DICT
data = {'Website': row}
if '.nl' in row: # MAKING THE DOMAIN DISTINCTION HERE
try:
driver.get(row[0])
html = driver.page_source
for searchstring in keywords:
if searchstring.lower() in html.lower():
print (row, searchstring, 'FOUND!')
data[searchstring] = 'FOUND!'
else:
print (row, searchstring, 'not found')
data[searchstring] = 'not found'
csv_writer(data, csv_output_file)
except:
pass
印刷结果:
C:\Python27\python.exe "C:/Users/Jacob/PycharmProjects/Testing/fooling around 2.py"
Process finished with exit code 0
所以我的脚本基本上不会在这种状态下做任何事情,除了导出几乎没有结果的CSV文件。
然而,当我简单地遗漏if '.nl' in row:
时,脚本运作完美。
我应该做什么调整才能使用脚本导入/抓取.nl
域名网址?
答案 0 :(得分:1)
for row in example_reader:
行type
是一个列表。所以它正在列表中寻找一个正好是“.nl”的项目。你有几个选择。如果CSV文件只包含一个包含URL的列,则可以更改:
if '.nl' in row:
到此:
if '.nl' in row[0]:
编辑:此外,您对row
的任何分配都需要更改为row[0]
,例如data = {'Website': row[0]}