Question

考虑：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import json

import unicodecsv as csv
import pandas as pd
tweets_data = []
tweets_file = open('tweets.txt', "r")

for line in tweets_file:
    try:
        tweet = json.loads(line)

        tweets_data.append(tweet)
    except:
        continue
tweets_file1 = open('tweets.csv', "wb")
tweets_file_writer = csv.writer(tweets_file1, encoding='utf-8')
tweets_file_writer.writerow(['location', 'time', 'user_id', 'text', 'hashtags', 'user_mentions'])
for i in tweets_data:
    location = unicode(i[u'user'][u'location']).encode('utf-8')
    time = unicode(i[u'created_at']).encode('utf-8')
    user_id = unicode(i[u'user'][u'id']).encode('utf-8')
    text = unicode(i[u'text']).encode('utf-8')
    hashtag = i[u'entities'][u'hashtags']
    hashtags = []
    for j in hashtag:
        print j[u'text']
        hashtags.append(u''.join(j[u'text']).encode('utf-8'))


    mention = i[u'entities'][u'user_mentions']
    mentions = []
    for j in mention:
        mentions.append(unicode(j[u'screen_name']).encode('utf-8'))

    tweets_file_writer.writerow([location, time, user_id, text,  hashtags, mentions])
tweets_file1.close()

我编写了这段代码，用于使用tweepy来抓取一些阿拉伯语数据。

我的问题在于这一行 tweets_file_writer.writerow([location, time, user_id, text, hashtags, mentions])添加主题标签列表时，它不会出现在阿拉伯语中，虽然其他所有数据都正常显示。

示例：

在CSV文件中，我需要编写一个主题标签列表，如：

['مجلة_النجوم2'，'سهيله_بن_لشهب'，'souhilabenlachhab']

看起来像这样：

[ '\ xd9 \ X85 \ XD8 \ XAC \ xd9 \ X84 \ XD8 \ xa9_ \ XD8 \ XA7 \ xd9 \ X84 \ xd9 \ 86 \ XD8 \ XAC \ xd9 \ X88 \ xd9 \ x852'， '\ XD8 \ XB3 \ xd9 \的x87 \ xd9 \ x8a \ xd9 \ X84 \ xd9 \ x87_ \ XD8 \ xa8 \ xd9 \ x86_ \ xd9 \ X84 \ XD8 \ XB4 \ xd9 \的x87 \ XD8 \ xa8'， 'souhilabenlachhab']

Answer 1

在尝试向其中编写阿拉伯语之前，您需要打开要写入utf-8编码文件的文件，因此：

tweets_file1 = open("tweets.csv", "wb")

应该是：

import codecs
tweets_file1 = codecs.open("tweets.csv", "wb", "utf-8")

而且，正如其他人所提到的，一旦你不再使用Python2，使用Python3就可以更轻松地使用阿拉伯语了！

使用Python 2.7 {CSV}在CSV文件中编写阿拉伯语

1 个答案: