将CSV文件中的列转换为Python中的单词直方图

时间:2018-02-15 12:17:09

标签: python csv

我有一个包含两列的CSV文件,它们没有标题/标题。我想忽略第一列并使用第二列(行[1])来制作词频的直方图。

但每个条目包含多个单词和其他代码答案我在这里将每个条目视为1个字符串,因此我的直方图最终具有相同的值,因为每个字符串在文件中出现一次。我也尝试将所有行[1]附加到列表中,但这也不起作用,并且具有相同高度的所有条形的相同结果。我想根据数据制作直方图,例如:

positive     This dress is great
negative     This coat is terrible
neutral      That dress was ok

希望直方图条的值为

This:2 is:2 dress:2 great:1 etc

2 个答案:

答案 0 :(得分:2)

选项1
使用csv模块和collections.Counter对象的组合:

import csv
from collections import Counter

data = []
with open('data.csv') as f:
    data = [word for row in csv.reader(f) for word in row[1].lower().split()]

counts = Counter(data)

选项2
使用pandas。将您的数据加载为Series,不指定带header=False的标头,并忽略带usecols=[1]的第一列(忽略0 th 列)。< / p>

import pandas as pd
s = pd.read_csv('data.csv', header=None, usecols=[1], squeeze=True)
s

0      This dress is great
1    This coat is terrible
2        That dress was ok
Name: 1, dtype: object

接下来,在空格str.split列上调用stack,然后拨打value_counts

s.str.lower().str.split(None, expand=True).stack().value_counts()

this        2
dress       2
is          2
coat        1
ok          1
was         1
great       1
terrible    1
that        1
dtype: int64

答案 1 :(得分:0)

只想添加,如果文件很大(非常大),您可以逐行阅读。试试这个代码。它在分割时将多个空格视为一个空格,并允许您忽略大小写。

from collections import defaultdict
import re


def read_string_column(csv_file,delimiter=','):
    """
    Read the file line by line
    """
    line = '1'
    while line:
        line = csv_file.readline()
        if len(line.split(delimiter))<2:
            continue
        yield line.split(delimiter)[1]


def add_word_counts(word_dict,new_words):
    """
    Add word counts to a defaultdict
    """
    for word in new_words:
        if word:
            word_dict[word]+=1

def count_words(csv_file,delimiter=',',ignore_case=True):
    """
    Count words.
    delimiter - csv file delimiter
    ignore_case - if True, the text will be read in a lower case
    """
    result_dict = defaultdict(int)
    if ignore_case:
        for words in read_string_column(csv_file,delimiter):
            add_word_counts(result_dict,re.split('\s+',words.lower().strip()))
    else:
        for words in read_string_column(csv_file,delimiter):
            add_word_counts(result_dict,re.split('\s+',words.strip()))
    return result_dict

if __name__=='__main__':
    f = open('1.csv','r')
    print(count_words(f,ignore_case=True))