如何在列表中排除与关键字匹配的网址-Web Scraping(Python)

时间:2018-10-30 11:33:50

标签: python selenium selenium-webdriver web-scraping

我被困在某个地方。我正在使用硒并使用python进行谷歌搜索提取。

现在我有一些关键字可以输入到Google搜索并提取数据(这是代码的作用)

我还有另一个否定列表,其中也包含某些关键字。现在我要检查那些关键字是否存在于提取的数据中,不要将它们追加到新列表中。我该怎么办?

下面是我的代码:

from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
import csv
import time
from itertools import groupby,chain
from operator import itemgetter
import sqlite3

final_data = []
def getresults():
    global final_data
    conn = sqlite3.connect("Jobs_data.db")
    conn.execute("""CREATE TABLE IF NOT EXISTS naukri(id INTEGER PRIMARY KEY, KEYWORD text, LINK text,
                            CONSTRAINT number_unique UNIQUE (KEYWORD,LINK))
                            """)
    cur = conn.cursor()
    #chrome_options = Options()
    #chrome_options.add_argument("--headless")
    #chrome_options.binary_location = '/Applications/Google Chrome   Canary.app/Contents/MacOS/Google Chrome Canary'
    driver = webdriver.Chrome("./chromedriver")
    with open("./"+"terms12.csv", "r") as csvfile:
        reader = csv.reader(csvfile)
        next(reader)
        for row in reader:
            keywords = row[0]
            url = "https://www.google.co.in/search?num=10&q=" + keywords
            driver.get(url)
            time.sleep(5)
            count = 0
            links = driver.find_elements_by_class_name("g")[:3]
            for i in links:
                data = i.find_elements_by_class_name("iUh30")
                dm = negativelist("junk.csv")
                print(dm)
                for news in data:     
                    sublist = []
                    data = news.text
                    if dm in data:
                        continue
                    print("I am in exception")
                    sublist.append(keywords)
                    sublist.append(data)
                    print(sublist)
                    final_data.append(sublist)
                    cur.execute("INSERT OR IGNORE INTO naukri VALUES (NULL,?,?)",(keywords,data))

    conn.commit()                    
    return final_data

def negativelist(file):
    sublist = []
    with open("./"+file,"r") as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            _data = row[0]
            sublist.append(_data)
    return sublist

def readfile(alldata, filename):
    with open ("./"+ filename, "w",encoding="utf-8") as csvfile:
        csvfile = csv.writer(csvfile, delimiter=",")
        csvfile.writerow("")
        for i in range(0, len(alldata)):
            csvfile.writerow(alldata[i])
def main():
    getresults()
    readfile([[k, *chain.from_iterable(r for _, *r in g)] for k, g in groupby(final_data, key=itemgetter(0))], "Naukri.csv")
main()

收到错误:

Traceback (most recent call last):
  File "C:\Users\prince.bhatia\Desktop\projects\google_Rank_Chcker1\Naukri-links.py", line 72, in <module>
    main()
  File "C:\Users\prince.bhatia\Desktop\projects\google_Rank_Chcker1\Naukri-links.py", line 70, in main
    getresults()
  File "C:\Users\prince.bhatia\Desktop\projects\google_Rank_Chcker1\Naukri-links.py", line 42, in getresults
    if dm in data:
TypeError: 'in <string>' requires string as left operand, not list

1 个答案:

答案 0 :(得分:1)

首先,您要检查NegativeKeywords中是否存在数据,这与说NegativeKeywords是否存在于数据中完全不同。

if data in dm:
    continue

可能您想要的是:

# Create a function to check if the data contains any of the negative keywords
def dataContainsNegativeKeyword(data, dm):
    for word in dm:
        if word in data:
            return true
    return false

# In the code check for that function with your kewywords and data
if dataContainsNegativeKeyword(data, dm):
    continue

然后您很奇怪地将关键字和数据都添加到子列表

 sublist.append(keywords)
 sublist.append(data)

也许在这里您想要获得的是将 sublist 定义为字典,然后添加 keywords (这可能是一个误名,也许 keyword < / em>应该更好,因为据我所知,它只是字典的键之一,而 data 则是值。

sublist = {}
# Rest of the code here
sublist[keywords] = data

您可以从代码中改进的另一件事是,每次迭代都加载否定关键字:

dm = negativelist("junk.csv")

您实际上不需要在每次迭代中都这样做,只需在begginig处声明:)