Question

我需要一些关于如何修复我在收集数据时犯的错误的知识。收集的数据具有以下结构：

["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]

在向其写入数据时，我通常不会将"["或"]"添加到.txt文件中，每行一行。但是，出现了错误，因此在加载文件时，它将通过以下方式将其分开：

有没有办法将数据正确加载到pandas？

Answer 1

在我可以从问题（我命名为test.txt）剪切和粘贴的片段中，我可以通过

成功读取数据帧

清除方括号（在Linux命令行中使用sed，但这可以通过文本编辑器完成，或者如果需要可以在python中完成）

sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line

加载数据帧（在python控制台中）

import pandas as pd
pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')

（不确定这适用于整个文件）。

Answer 2

请考虑以下代码，该代码读取myfile.text中的文字，如下所示：

["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]

下面的代码会从[中删除]和text，然后将,中字符串列表中的每个字符串拆分，排除第一个字符串作为标题。有些Message包含,，这会导致另一列（否则为NAN），因此代码会将它们转换为一个字符串。代码：

 with open('myfile.txt', 'r') as my_file:
    text = my_file.read()
    text = text.replace("[", "")
    text = text.replace("]", "")

df = pd.DataFrame({
    'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
    'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))

输出：

    Author                             Message
0   littleblackcat    There's a lot of redditors here that live in the area  maybe/hopefully someone saw something. 
1   Kruse             In other words it's basically creating a mini tornado.

Answer 3

纯pandas选项是将分隔符从,更改为", "，以便只有2列，然后删除不需要的字符，根据我的理解[ ，]，"和空格：

import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''

df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]

print(df)
# Output (note the space before the There's is also gone
#            Author                                            Message
# 0  littleblackcat  There's a lot of redditors here that live in t...
# 1           Kruse  In other words, it's basically creating a mini...

Answer 4

以下是一些可以添加到混音中的选项：

您可以使用ast.literal_eval自行解析这些行，然后使用迭代器直接将它们加载到pd.DataFrame中：
```
import pandas as pd
import ast
with open('data', 'r') as f:
    lines = (ast.literal_eval(line) for line in f)
    header = next(lines)
    df = pd.DataFrame(lines, columns=header)
    print(df)
```
但请注意，为每行调用ast.literal_eval一次可能不会非常快，特别是如果您的数据文件有很多行。但是，如果数据文件不是太大，这可能是一个可接受的简单解决方案。

另一种选择是在bytes中包装一个任意迭代器（产生IterStream）。这个非常通用的工具（thanks to Mechanical snail）允许您操作任何文件的内容，然后将其重新打包到类似文件的对象中。因此，您可以修复文件的内容，但仍然将其传递给任何需要类似文件的对象的函数，例如pd.read_csv。（注意：我使用相同的工具here回答了类似的问题。）

import io
import pandas as pd

def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
    """
    http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
    Lets you use an iterable (e.g. a generator) that yields bytestrings as a
    read-only input stream.

    The stream implements Python 3's newer I/O API (available in Python 2's io
    module).

    For efficiency, the stream is buffered.
    """
    class IterStream(io.RawIOBase):
        def __init__(self):
            self.leftover = None
        def readable(self):
            return True
        def readinto(self, b):
            try:
                l = len(b)  # We're supposed to return at most this much
                chunk = self.leftover or next(iterable)
                output, self.leftover = chunk[:l], chunk[l:]
                b[:len(output)] = output
                return len(output)
            except StopIteration:
                return 0    # indicate EOF
    return io.BufferedReader(IterStream(), buffer_size=buffer_size)

def clean(f):
    for line in f:
        yield line.strip()[1:-1]+b'\n'

with open('data', 'rb') as f:
    # https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
    df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
    print(df)

Answer 5

现在找到以下解决方案：

sep = '[|"|]'

使用多字符分隔符允许括号存储在pandas数据帧的不同列中，然后将其删除。这样可以避免必须删除行的单词行。

Python数据集加载错误

5 个答案: