Python数据集加载错误

时间:2018-05-14 14:58:08

标签: python python-3.x pandas dataframe

我需要一些关于如何修复我在收集数据时犯的错误的知识。收集的数据具有以下结构:

["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]

在向其写入数据时,我通常不会将"[""]"添加到.txt文件中,每行一行。但是,出现了错误,因此在加载文件时,它将通过以下方式将其分开:

Pandas Data

有没有办法将数据正确加载到pandas?

5 个答案:

答案 0 :(得分:1)

在我可以从问题(我命名为test.txt)剪切和粘贴的片段中,我可以通过

成功读取数据帧
  1. 清除方括号(在Linux命令行中使用sed,但这可以通过文本编辑器完成,或者如果需要可以在python中完成)

    sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
    sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line
    
  2. 加载数据帧(在python控制台中)

    import pandas as pd
    pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')
    
  3. (不确定这适用于整个文件)。

答案 1 :(得分:0)

请考虑以下代码,该代码读取myfile.text中的文字,如下所示:

["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]

下面的代码会从[中删除]text,然后将,中字符串列表中的每个字符串拆分,排除第一个字符串作为标题。有些Message包含,,这会导致另一列(否则为NAN),因此代码会将它们转换为一个字符串。 代码:

 with open('myfile.txt', 'r') as my_file:
    text = my_file.read()
    text = text.replace("[", "")
    text = text.replace("]", "")

df = pd.DataFrame({
    'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
    'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))

输出:

    Author                             Message
0   littleblackcat    There's a lot of redditors here that live in the area  maybe/hopefully someone saw something. 
1   Kruse             In other words it's basically creating a mini tornado.

答案 2 :(得分:0)

纯pandas选项是将分隔符从,更改为", ",以便只有2列,然后删除不需要的字符,根据我的理解[]"和空格:

import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''

df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]

print(df)
# Output (note the space before the There's is also gone
#            Author                                            Message
# 0  littleblackcat  There's a lot of redditors here that live in t...
# 1           Kruse  In other words, it's basically creating a mini...

答案 3 :(得分:0)

以下是一些可以添加到混音中的选项:

  1. 您可以使用ast.literal_eval自行解析这些行,然后使用迭代器直接将它们加载到pd.DataFrame中:

    import pandas as pd
    import ast
    with open('data', 'r') as f:
        lines = (ast.literal_eval(line) for line in f)
        header = next(lines)
        df = pd.DataFrame(lines, columns=header)
        print(df)
    

    但请注意,为每行调用ast.literal_eval一次可能不会非常快,特别是如果您的数据文件有很多行。但是,如果数据文件不是太大,这可能是一个可接受的简单解决方案。

  2. 另一种选择是在bytes中包装一个任意迭代器(产生IterStream)。这个非常通用的工具(thanks to Mechanical snail)允许您操作任何文件的内容,然后将其重新打包到类似文件的对象中。因此,您可以修复文件的内容,但仍然将其传递给任何需要类似文件的对象的函数,例如pd.read_csv。 (注意:我使用相同的工具here回答了类似的问题。)

    import io
    import pandas as pd
    
    def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
        """
        http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
        Lets you use an iterable (e.g. a generator) that yields bytestrings as a
        read-only input stream.
    
        The stream implements Python 3's newer I/O API (available in Python 2's io
        module).
    
        For efficiency, the stream is buffered.
        """
        class IterStream(io.RawIOBase):
            def __init__(self):
                self.leftover = None
            def readable(self):
                return True
            def readinto(self, b):
                try:
                    l = len(b)  # We're supposed to return at most this much
                    chunk = self.leftover or next(iterable)
                    output, self.leftover = chunk[:l], chunk[l:]
                    b[:len(output)] = output
                    return len(output)
                except StopIteration:
                    return 0    # indicate EOF
        return io.BufferedReader(IterStream(), buffer_size=buffer_size)
    
    def clean(f):
        for line in f:
            yield line.strip()[1:-1]+b'\n'
    
    with open('data', 'rb') as f:
        # https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
        df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
        print(df)
    

答案 4 :(得分:-1)

现在找到以下解决方案:

sep = '[|"|]'

使用多字符分隔符允许括号存储在pandas数据帧的不同列中,然后将其删除。这样可以避免必须删除行的单词行。