Question

我有一个简单的项目，即从旅游网站上抓取评论并将其存储在Excel文件中。评论可以是西班牙语，日语或任何其他语言，评论有时也包含像“❤❤”这样的特殊符号。

我需要存储所有数据（如果无法写入，可以排除特殊符号）。

我能够抓取我想要的数据并将其打印到控制台中（如日文文本），但问题是将其存储在csv文件中，它显示错误消息，如下所示

我尝试使用utf-8编码打开文件（如下面的注释中所述），但随后它将数据保存在一些奇怪的符号中，没有任何意义 ....并且找不到问题的答案。任何建议。

我正在使用python 3.5.3

我的python代码：

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import re

file = "TajMahalSpanish.csv"
f = open(file, "w")
headers = "rating, title, review\n"
f.write(headers)

pages = 119
pageNumber = 2
option = webdriver.ChromeOptions()
option.add_argument("--incognito")

browser = webdriver.Chrome(executable_path='C:\Program Files\JetBrains\PyCharm Community Edition 2017.1.5\chrome webdriver\chromedriver', chrome_options=option)

browser.get("https://www.tripadvisor.in/Attraction_Review-g297683-d317329-Reviews-Taj_Mahal-Agra_Agra_District_Uttar_Pradesh.html")
time.sleep(10)
browser.find_element_by_xpath('//*[@id="taplc_location_review_filter_controls_0_form"]/div[4]/ul/li[5]/a').click()
time.sleep(5)
browser.find_element_by_xpath('//*[@id="BODY_BLOCK_JQUERY_REFLOW"]/span/div[1]/div/form/ul/li[2]/label').click()
time.sleep(5)

while (pages):
    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")
    containers = soup.find_all("div",{"class":"innerBubble"})

    showMore = soup.find("span", {"onclick": "widgetEvCall('handlers.clickExpand',event,this);"})
    if showMore:
        browser.find_element_by_xpath("//span[@onclick=\"widgetEvCall('handlers.clickExpand',event,this);\"]").click()
        time.sleep(3)
        html = browser.page_source
        soup = BeautifulSoup(html, "html.parser")
        containers = soup.find_all("div", {"class": "innerBubble"})
        showMore = False

    for container in containers:
        bubble = container.div.div.span["class"][1]
        title = container.div.find("div", {"class": "quote"}).a.span.text
        review = container.find("p", {"class": "partial_entry"}).text
        f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
        print(bubble)
        print(title)
        print(review)
    browser.find_element_by_xpath("//div[@class='ppr_rup ppr_priv_location_reviews_list']//div[@class='pageNumbers']/span[@data-page-number='" + str(pageNumber) + "']").click()
    time.sleep(5)
    pages -= 1
    pageNumber += 1

f.close()

我收到以下错误：

Traceback (most recent call last):
  File "C:/Users/Akshit/Documents/pycharmProjects/spanish.py", line 45, in <module>
    f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
  File "C:\Users\Akshit\AppData\Local\Programs\Python\Python35\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-18: character maps to <undefined>

Process finished with exit code 1

更新

我正在尝试解决此问题。最后我需要将日语评论翻译成英语以及研究，所以我可以使用google api之一在编写代码之前将字符串转换为字符串，然后将其写入csv文件中。 ..

Answer 1

UPDATE

Found the solution in

Is it possible to force Excel recognize UTF-8 CSV files automatically?

as suggested by @MaartenFabré in the comments.

Basically from what I understood, the problem is that Excel file has problems in reading csv file with utf-8 encoding so when i directly opens the csv file (made via python) with Excel...all the data is corrupted.

The solution is that:

I saved the data in a text file, instead of csv in python
Open Excel
Go to import external data and import using a txt file
select file type as "delimited" and file origin as "650001: Unicode (utf-8)"
Select "," as the delimiter (your choice) and import
Data is correctly shown in the excel in proper rows and column for every language...japenese, spanish, french etc.

Again thanks to @MaartenFabre for the help !

如何将非英文字符串存储到excel文件python3中？

1 个答案: