使用SQL从输出中删除重复项

时间:2015-11-19 13:07:53

标签: python mysql duplicates web-crawler

在我的第一步中,我抓取了某些数据,例如价格,标题和网页的深层链接,这是我成功完成的。之后,我管理它存储到数据库中。

但问题是它还提供了许多我想在数据库中删除的重复项。

在尝试删除重复项时,似乎我遇到了问题。有人可以帮我吗?我必须使用什么样的SQL查询才能删除重复的SQL查询。

感谢任何反馈:)

以下是代码:

hallo = soup.find_all("article", {"class": "activity-card activity-card-horizontal "})

try:
    connection = mysql.connector.connect\
        (host = "localhost", user = "root", passwd ="", db = "output")
except:
    print("No connection to Server")
    sys.exit(0)

cursor = connection.cursor()

cursor.execute("DELETE from prices_crawled where LocationID=" + str(location_ID) + " and PartnerID=" + str(partner_ID))
connection.commit()

for item in hallo:
    headers = item.find_all("h3", {"class": "activity-"})
    for header in headers:
        header_final = header.text.strip()

    prices = item.find_all("span", {"class": "price"})
    for price in prices:
        price_final = price.text.strip()[2:]

    deeplinks = item.find_all("a", {"class": "title"})
    for t in set(t.get("href") for t in deeplinks):
        deeplink_final = t

    Language = "Englisch"

    print("Header: " + header_final + " | " + "Price: " + str(price_final) + " | " + "Deeplink: " + deeplink_final + " | " + "PartnerID: " + str(partner_ID) + " | " + "LocationID: " + str(location_ID)+ " | " + "Language: " + Language)

    cursor.execute('''INSERT INTO prices_crawled (price_id, Header, Price, Deeplink, PartnerID, LocationID, Language) \
            VALUES(%s, %s, %s, %s, %s, %s, %s)''', ['None'] + [header_final] + [price_final] + [deeplink_final] + [partner_ID] + [location_ID] + [Language])

    connection.commit()


cursor.close()
connection.close()

0 个答案:

没有答案