如何将非英文字符串存储到excel文件python3中?

时间:2017-08-03 13:05:11

标签: python python-3.x selenium web-scraping beautifulsoup

我有一个简单的项目,即从旅游网站上抓取评论并将其存储在Excel文件中。 评论可以是西班牙语,日语或任何其他语言, 评论有时也包含像“❤❤”这样的特殊符号。

我需要存储所有数据(如果无法写入,可以排除特殊符号)。

我能够抓取我想要的数据并将其打印到控制台中(如日文文本),但问题是将其存储在csv文件中,它显示错误消息,如下所示

我尝试使用utf-8编码打开文件(如下面的注释中所述),但随后它将数据保存在一些奇怪的符号中,没有任何意义 ....并且找不到问题的答案。任何建议。

我正在使用python 3.5.3

我的python代码:

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import re

file = "TajMahalSpanish.csv"
f = open(file, "w")
headers = "rating, title, review\n"
f.write(headers)

pages = 119
pageNumber = 2
option = webdriver.ChromeOptions()
option.add_argument("--incognito")

browser = webdriver.Chrome(executable_path='C:\Program Files\JetBrains\PyCharm Community Edition 2017.1.5\chrome webdriver\chromedriver', chrome_options=option)

browser.get("https://www.tripadvisor.in/Attraction_Review-g297683-d317329-Reviews-Taj_Mahal-Agra_Agra_District_Uttar_Pradesh.html")
time.sleep(10)
browser.find_element_by_xpath('//*[@id="taplc_location_review_filter_controls_0_form"]/div[4]/ul/li[5]/a').click()
time.sleep(5)
browser.find_element_by_xpath('//*[@id="BODY_BLOCK_JQUERY_REFLOW"]/span/div[1]/div/form/ul/li[2]/label').click()
time.sleep(5)

while (pages):
    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")
    containers = soup.find_all("div",{"class":"innerBubble"})

    showMore = soup.find("span", {"onclick": "widgetEvCall('handlers.clickExpand',event,this);"})
    if showMore:
        browser.find_element_by_xpath("//span[@onclick=\"widgetEvCall('handlers.clickExpand',event,this);\"]").click()
        time.sleep(3)
        html = browser.page_source
        soup = BeautifulSoup(html, "html.parser")
        containers = soup.find_all("div", {"class": "innerBubble"})
        showMore = False

    for container in containers:
        bubble = container.div.div.span["class"][1]
        title = container.div.find("div", {"class": "quote"}).a.span.text
        review = container.find("p", {"class": "partial_entry"}).text
        f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
        print(bubble)
        print(title)
        print(review)
    browser.find_element_by_xpath("//div[@class='ppr_rup ppr_priv_location_reviews_list']//div[@class='pageNumbers']/span[@data-page-number='" + str(pageNumber) + "']").click()
    time.sleep(5)
    pages -= 1
    pageNumber += 1

f.close()

我收到以下错误:

Traceback (most recent call last):
  File "C:/Users/Akshit/Documents/pycharmProjects/spanish.py", line 45, in <module>
    f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
  File "C:\Users\Akshit\AppData\Local\Programs\Python\Python35\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-18: character maps to <undefined>

Process finished with exit code 1

更新

我正在尝试解决此问题。最后我需要将日语评论翻译成英语以及研究,所以我可以使用google api之一在编写代码之前将字符串转换为字符串,然后将其写入csv文件中。 ..

1 个答案:

答案 0 :(得分:0)

UPDATE

Found the solution in

Is it possible to force Excel recognize UTF-8 CSV files automatically?

as suggested by @MaartenFabré in the comments.

Basically from what I understood, the problem is that Excel file has problems in reading csv file with utf-8 encoding so when i directly opens the csv file (made via python) with Excel...all the data is corrupted.

The solution is that:

  1. I saved the data in a text file, instead of csv in python
  2. Open Excel
  3. Go to import external data and import using a txt file
  4. select file type as "delimited" and file origin as "650001: Unicode (utf-8)"
  5. Select "," as the delimiter (your choice) and import
  6. Data is correctly shown in the excel in proper rows and column for every language...japenese, spanish, french etc.

Again thanks to @MaartenFabre for the help !