Python:在多个CSV中引用字符串并将文件合并在一起

时间:2019-03-27 03:31:41

标签: python python-2.7 csv merge nlp

我有一个大约600个CSV文件的目录,其中包含Twitter数据以及具有各种类型(整数,浮点数和字符串)的多个字段。我有一个可以将文件合并在一起的脚本,但是字符串字段本身可能包含逗号,但不加引号,这会导致字符串字段分开并强制在新行上显示文本。是否可以引用每个文件中的字符串,然后将它们合并为一个文件?下面是我用来合并文件和一些示例数据的脚本。

合并脚本:     %%时间     导入csv     导入球     从tqdm导入tqdm

with open('C:\Python\Scripts\Test_tweets\Test_output.csv', 'wb') as f_output:
    csv_output = csv.writer(f_output, quoting=csv.QUOTE_NONNUMERIC)
    write_header = True

    for filename in tqdm(glob.glob(r'C:\Python\Scripts\Test_tweets\*.csv')):
        with open(filename, 'rb') as f_input:
            csv_input = csv.reader(f_input)
            header = next(csv_input)

            if write_header:
                csv_output.writerow(header)
                write_header = False

            for row in tqdm(csv_input):
                row = row[:7] + [','.join(row[7:])]

                # Skip rows with insufficient values                
                if len(row) > 7:
                    row[1] = float(row[1])
                    row[5] = float(row[5])
                    row[6] = float(row[6])
                    csv_output.writerow(row)

样本数据:

2014-02-07T00:25:40Z,431584511542198272,FalseAlarm_xox,en,-,-81.4994315,35.3268904,is still get hair done,Is Still Getting Hair Done
2014-02-07T00:25:40Z,431584511525003265,enabrkovic,en,-,-85.40364208,40.19369368,i had no class todai why did i wait 630 to start do everyth,I had no classes today why did I wait  630 to start doing EVERYTHING
2014-02-07T00:25:41Z,431584515757457408,_beacl,pt,-,-48.05338676,-16.02483911,passei o dia com o meu amor comemo demai <3 @guugaraujo,passei o dia com o meu amor, comemos demais ❤️ @guugaraujo
2014-02-07T00:25:42Z,431584519930396672,aprihasanah,in,-,106.9224971,-6.2441371,4 hari ngga ada kepsek rasanya nyaman bgt kerjaan juga lebih teratur tp skalinya doi masuk administrasi kacau balau lg yanasib,4 hari ngga ada kepsek rasanya nyaman bgt. kerjaan juga lebih teratur. tp skalinya doi masuk, administrasi kacau balau lg. yanasib &gt;_&lt;"
2014-02-07T00:25:42Z,431584519951749120,MLEFFin_awesome,en,-,-77.20315866,39.08811105,never a dull moment with emma <3 /MLEFFin_awesome/status/431584519951749120/photo/1,Never a dull moment with Emma  /0Wfs5VqfVz
2014-02-07T00:25:43Z,431584524120510464,mimiey_natasya,en,-,103.3596089,3.9210196,good morn,Good morning...
2014-02-07T00:25:43Z,431584524124684288,louykins,en,-,-86.06823257,41.74938946,that Oikos commerci with @johnstamos @bobsaget and @davecoulier is better than my whole life #takesmeback #youcankissmeanytimejohn,That Oikos commercial with @JohnStamos, @bobsaget, and @DaveCoulier is better than my whole life. #takesmeback #youcankissmeanytimejohn
2014-02-07T00:25:44Z,431584528306421760,savannachristy4,en,-,-79.99920285,39.65367864,rememb when we would go to club zoo :D,Remember when we would go to club zoo??
2014-02-07T00:25:44Z,431584528302231553,janiya_monet,en,-,-83.62028684,39.20591822,@itscourtney_365 thei call,@ItsCourtney_365 they. Called.
2014-02-07T00:25:44Z,431584528302223360,norastanky,en,-,-118.09849064,33.79394737,when you see your hometown in your english book /norastanky/status/431584528302223360/photo/1,When you see your hometown in your english book&gt;&gt; /XHRFymLFp4
2014-02-07T00:25:46Z,431584536703799296,Ericb1980,en,-,-82.32639648,27.92373599,i'm at longhorn steakhouse brandon fl .com/1bzZsrp,I'm at LongHorn Steakhouse (Brandon, FL) /YdCJKXmSmN
2014-02-07T00:25:46Z,431584536695410688,repokempt,en,-,37.40298473,55.96248794,@tonichopchop moron drive me nut,@tonichopchop Morons. Drives me nuts!
2014-02-07T00:25:47Z,431584540889317377,BeeNiabee6,en,-,-82.494139,27.4908062,my god sister got drink,My God sister got drinking
2014-02-08T00:00:01Z,4.3194E+17,NewarkWeather,in,-,-75.68444444,39.695,02 07 @19 00 temp 31.0 f wc 31.0 f wind 0.0 mph gust 0.0 mph bar 30.358 in rise rain 0.00 in hum 68 uv 0.0 solarrad 0,02/07@19:00 - Temp 31.0F, WC 31.0F. Wind 0.0mph ---, Gust 0.0mph. Bar 30.358in, Rising. Rain 0.00in. Hum 68%. UV 0.0. SolarRad 0.,,,,,,,,,,,,,,
2014-02-08T00:00:02Z,4.3194E+17,bastianwr,in,-,106.11073,-2.1198,happi weekend at sman 1 pangkalpinang https://path.com/p/1zjYtB,Happy Weekend! (at SMAN 1 Pangkalpinang) — /9U86N1BmD6,,,,,,,,,,,,,,,,,
2014-02-08T00:00:03Z,4.3194E+17,izaklast,en,-,-109.9176369,31.40244847,dihydrogen monoxid is good for you Watermill express .com/1bxHT81,Dihydrogen monoxide is good for you (@ Watermill Express) /IvfiuNHigM,,,,,,,,,,,,,,,,,
2014-02-08T00:00:03Z,4.3194E+17,blackbestpeople,tr,-,29.21950004,40.91441821,okulda özlediyim sadec kantindeki kakayolu süd,Okulda özlediyim sadece kantindeki kakayolu süd,,,,,,,,,,,,,,,,,
2014-02-08T00:00:03Z,4.3194E+17,Hakooo03,tr,-,3.72651687,51.06650946,gta v oynar katliam cikartirim bend,Gta v oynar katliam cikartirim bende !,,,,,,,,,,,,,,,,,
2014-02-08T00:00:03Z,4.3194E+17,piaras_14,en,-,-6.21720811,54.11456545,@blainmcg17 wee hornbal #taughtyouwell /piaras_14/status/431940452770934784/photo/1,@blainmcg17 wee hornball #taughtyouwell /C6yGymDoyl,,,,,,,,,,,,,,,,,
2014-02-08T00:00:04Z,4.3194E+17,PPompita,es,-,9.3215546,40.315019,@enrique305 esto es perfecto uauh yo y mi hermano v a ny al concierto lo enamorado 15feb desd italia solo para ti /PPompita/status/431940456973619200/photo/1,@enrique305 Esto es Perfecto uauh yo y mi hermano V a NY al concierto Los Enamorados 15Feb desde Italia solo para ti. /OrYYE2zN80,,,,,,,,,,,,,,,,,
2014-02-08T00:00:05Z,4.3194E+17,NickMontesdeoca,und,-,-71.34854858,42.63122899,<3,,,,,,,,,,,,,,,,,,
2014-02-08T00:00:05Z,4.3194E+17,Askin28Furkan,tr,-,28.6281946,41.0166627,birakma beni insanlar kötü bırakma beni korkuyorumm,Birakma beni insanlar kötü, bırakma beni korkuyorumm,,,,,,,,,,,,,,,,
2014-02-08T00:00:05Z,4.3194E+17,mumfy98,en,-,-75.59400911,43.08187836,i just want a horse,I just want a horse!!,,,,,,,,,,,,,,,,,
2014-02-08T00:00:05Z,4.3194E+17,Pitmedden_Weath,en,-,-2.18416667,57.33888889,wind 7.2 mph s Barometer 979.9 hpa fall temperature 2.6 c rain todai 0.0 mm forecast stormi much precipitation,Wind 7.2mph S. Barometer 979.9hPa, Falling. Temperature 2.6°C. Rain today 0.0mm. Forecast Stormy, much precipitation,,,,,,,,,,,,,,,
2014-02-08T00:00:06Z,4.3194E+17,BoeBaFett,en,-,-79.0129325,33.794075,2 whole hour still no repli,2 whole hours... still no reply,,,,,,,,,,,,,,,,,

2 个答案:

答案 0 :(得分:2)

如果可以将最后两个字段合并为一个字符串,则可以使用以下方法:

  1. 使用变量来确定是否需要写入标头。始终始终首先读取标头(使用next())。如果为True,请写标题,否则将其丢弃。
  2. 首先剥离该行并将其在,上拆分七次。然后,这会将最后两个字符串字段保留为单个值。
  3. 接下来使用一个函数尝试将每个字段转换为整数或浮点数。
  4. 使用csv quoting=csv.QUOTE_NONNUMERIC选项对所有剩余的字符串值强制加引号。

这可以如下进行:

import csv


def get_number(value):
    "Convert numberic strings into ints and floats"

    try:
        value = int(value)
    except ValueError:
        try:
            value = float(value)
        except ValueError:
            pass

    return value


with open('output.csv', 'wb') as f_output:
    csv_output = csv.writer(f_output, quoting=csv.QUOTE_NONNUMERIC)
    write_header = True

    with open('sample.csv') as f_input:
        header = next(f_input).strip().split(',')

        if write_header:
            csv_output.writerow(header)
            write_header = False

        for row in f_input:
            row = [get_number(value) for value in row.strip().split(',', 7)]
            csv_output.writerow(row)

这将使您开始输出:

"1/1/1",1,"username1","en","-",-39.0,162,"Dreamlike. Semi-sensical. Sort of terrifying. The site is less a Twitter toy than a disturbing peer into my subconscious.,Dreamlike. Semi-sensical. Sort of terrifying. The site is less a Twitter toy than a disturbing peer into my subconscious."
"1/1/2",2,"username2","en","-",84.0,147,"The results are, predictably, hilarious. I couldn't have said it better myself,The results are, predictably, hilarious. I couldn't have said it better myself"
"1/1/3",3,"username3","en","-",-22.0,-180,"This site is providing some good laughs this morning here at the Twitter office.,This site is providing some good laughs this morning here at the Twitter office."
"1/1/4",4,"username4","en","-",-28.0,-49,"You can image what something like this might look like five, ten or twenty years from now, as our technical capabilities improve,You can image what something like this might look like five, ten or twenty years from now, as our technical capabilities improve"

然后可以将该方法扩展为在多个输入文件上使用。


如果您的某些数据已被引用,并且int和float都位于已知列中,则需要一种不同的方法。示例数据仅显示未引用的数据。

import csv

with open('output.csv', 'wb') as f_output:
    csv_output = csv.writer(f_output, quoting=csv.QUOTE_NONNUMERIC)
    write_header = True

    with open('sample.csv', 'rb') as f_input:
        csv_input = csv.reader(f_input)
        header = next(csv_input)

        if write_header:
            csv_output.writerow(header)
            write_header = False

        for row in csv_input:
            row = row[:7] + [','.join(row[7:])]

            # Skip rows with insufficient values                
            if len(row) > 7:
                row[1] = int(row[1])
                row[5] = float(row[5])
                row[6] = float(row[6])
                csv_output.writerow(row)

要使用多个文件,您需要添加一个循环以读取每个CSV文件名,例如:

import csv
import glob

with open('output.csv', 'wb') as f_output:
    csv_output = csv.writer(f_output, quoting=csv.QUOTE_NONNUMERIC)
    write_header = True

    for filename in glob.glob(r'C:\Python\Scripts\Test_tweets\*.csv'):
        with open(filename, 'rb') as f_input:
            csv_input = csv.reader(f_input)
            header = next(csv_input)

            if write_header:
                csv_output.writerow(header)
                write_header = False

            for row in csv_input:
                row = row[:7] + [','.join(row[7:])]

                # Skip rows with insufficient values                
                if len(row) > 7:
                    row[1] = int(row[1])
                    row[5] = float(row[5])
                    row[6] = float(row[6])
                    csv_output.writerow(row)

注意:不要忘记在文件夹字符串的前面加上r,以防止Python尝试转义\字符。

答案 1 :(得分:-1)

样本数据已损坏。正确的数据:

1,2,3,"Value with separator (,) must be in quotes",Value without comma

请参见https://tools.ietf.org/html/rfc4180

  
      
  1. 包含换行符(CRLF),双引号和逗号的字段      应该用双引号引起来。例如:

         

    “ aaa”,“ b CRLF”

         

    bb“,” ccc“ CRLF

         

    zzz,yyy,xxx

  2.