解析由tab分隔的推文到csv,tweet文本包含tab,如何保持单列?

时间:2014-05-12 14:21:38

标签: python json csv twitter

我写了一个简单的python脚本来解析Twitter数据。但是,我遇到一个问题,一些用户在他们的推文中放了一个看似是标签的内容,然后我的脚本认为它是一个新列,并对其进行解析。我想知道在Python中强制推文文本全部包含在一列中的最佳方法。

**示例**

465853965351927808  AhmedAlKhalifa_ RT @Milanello: Another photo of how Casa Milan looks now after the color treatment is done:
#ForzaMilan http://t.co/p8YaBXpgj1
465853965142597633  AlySnodgrass    RT @LJSanders88: Who's ready for the new reality tv show: "Late Night Shenanigans!" Starring- Ozark Seniors and co-starring- the Law Enfo…
465853965289422849  amandafaang oh i see! we should meet up w all the bx-ians soon haha — yess! http://t.co/Isdg7hjYbV
465853964786089985  isla_galloway_x RT @fuxkchan: Tomorrowland is defo on the bucket list
465853965515493376  usptz   7 o'clock in the morning
465853965385482240  Orapinploy  RT @FolkFunFine: I want to see the blue sky
465853965297790976  Khansheeren My answer to What on the internet made you smile today? http://t.co/TQKBJeOx4b
465853965150998528  khenDict    Ah almost left the house without seeing khaya...ah guys warn me next time!!!!
#YOUTVLIVE
#YOUTVLIVE
#YOUTVLIVE
465853965310382080  1987Lukyanova   Мое новое достижение `Больш...`. Попробуй превзойти меня в The Tribez для #Android! http://t.co/HWEQQloFWB #androidgames, #gameinsight

代码:

import json
import sys

def main():

    for line in sys.stdin:
        line = line.strip()

        data = []

        try:
            data.append(json.loads(line))
        except ValueError as detail:
            continue

        for tweet in data:

            ## deletes any rate limited data
            if tweet.has_key('limit'):
                pass

            else:
                print "\t".join([
                tweet['id_str'],
                tweet['user']['screen_name'],
                tweet['text']
                ]).encode('utf8')

if __name__ == '__main__':
    main()

1 个答案:

答案 0 :(得分:0)

不是手动生成TSV文件,而是使用csv模块,该模块将为您提供转义任何文字选项卡。 codecs模块可用于在写入标准输出时为您自动编码文本。

import json
import sys
import csv
import codecs

def main():

    writer = csv.writer(codecs.getwriter('utf8')(sys.stdout), delimiter="\t")
    for line in sys.stdin:
        line = line.strip()

        data = []

        try:
            data.append(json.loads(line))
        except ValueError as detail:
            continue

        for tweet in data:

            ## deletes any rate limited data
            if tweet.has_key('limit'):
                pass

            else:
                writer.writerow([
                tweet['id_str'],
                tweet['user']['screen_name'],
                tweet['text']
                ])

if __name__ == '__main__':
    main()