在我的第一步中,我抓取了某些数据,例如价格,标题和网页的深层链接,这是我成功完成的。之后,我管理它存储到数据库中。
但问题是它还提供了许多我想在数据库中删除的重复项。
在尝试删除重复项时,似乎我遇到了问题。有人可以帮我吗?我必须使用什么样的SQL查询才能删除重复的SQL查询。
感谢任何反馈:)
以下是代码:
hallo = soup.find_all("article", {"class": "activity-card activity-card-horizontal "})
try:
connection = mysql.connector.connect\
(host = "localhost", user = "root", passwd ="", db = "output")
except:
print("No connection to Server")
sys.exit(0)
cursor = connection.cursor()
cursor.execute("DELETE from prices_crawled where LocationID=" + str(location_ID) + " and PartnerID=" + str(partner_ID))
connection.commit()
for item in hallo:
headers = item.find_all("h3", {"class": "activity-"})
for header in headers:
header_final = header.text.strip()
prices = item.find_all("span", {"class": "price"})
for price in prices:
price_final = price.text.strip()[2:]
deeplinks = item.find_all("a", {"class": "title"})
for t in set(t.get("href") for t in deeplinks):
deeplink_final = t
Language = "Englisch"
print("Header: " + header_final + " | " + "Price: " + str(price_final) + " | " + "Deeplink: " + deeplink_final + " | " + "PartnerID: " + str(partner_ID) + " | " + "LocationID: " + str(location_ID)+ " | " + "Language: " + Language)
cursor.execute('''INSERT INTO prices_crawled (price_id, Header, Price, Deeplink, PartnerID, LocationID, Language) \
VALUES(%s, %s, %s, %s, %s, %s, %s)''', ['None'] + [header_final] + [price_final] + [deeplink_final] + [partner_ID] + [location_ID] + [Language])
connection.commit()
cursor.close()
connection.close()