我需要一些关于如何修复我在收集数据时犯的错误的知识。收集的数据具有以下结构:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
在向其写入数据时,我通常不会将"["
或"]"
添加到.txt文件中,每行一行。但是,出现了错误,因此在加载文件时,它将通过以下方式将其分开:
有没有办法将数据正确加载到pandas?
答案 0 :(得分:1)
在我可以从问题(我命名为test.txt
)剪切和粘贴的片段中,我可以通过
清除方括号(在Linux命令行中使用sed
,但这可以通过文本编辑器完成,或者如果需要可以在python中完成)
sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line
加载数据帧(在python控制台中)
import pandas as pd
pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')
(不确定这适用于整个文件)。
答案 1 :(得分:0)
请考虑以下代码,该代码读取myfile.text
中的文字,如下所示:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]
下面的代码会从[
中删除]
和text
,然后将,
中字符串列表中的每个字符串拆分,排除第一个字符串作为标题。有些Message
包含,
,这会导致另一列(否则为NAN
),因此代码会将它们转换为一个字符串。
代码:
with open('myfile.txt', 'r') as my_file:
text = my_file.read()
text = text.replace("[", "")
text = text.replace("]", "")
df = pd.DataFrame({
'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))
输出:
Author Message
0 littleblackcat There's a lot of redditors here that live in the area maybe/hopefully someone saw something.
1 Kruse In other words it's basically creating a mini tornado.
答案 2 :(得分:0)
纯pandas选项是将分隔符从,
更改为", "
,以便只有2列,然后删除不需要的字符,根据我的理解[
,]
,"
和空格:
import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''
df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]
print(df)
# Output (note the space before the There's is also gone
# Author Message
# 0 littleblackcat There's a lot of redditors here that live in t...
# 1 Kruse In other words, it's basically creating a mini...
答案 3 :(得分:0)
以下是一些可以添加到混音中的选项:
您可以使用ast.literal_eval
自行解析这些行,然后使用迭代器直接将它们加载到pd.DataFrame
中:
import pandas as pd
import ast
with open('data', 'r') as f:
lines = (ast.literal_eval(line) for line in f)
header = next(lines)
df = pd.DataFrame(lines, columns=header)
print(df)
但请注意,为每行调用ast.literal_eval
一次可能不会非常快,特别是如果您的数据文件有很多行。但是,如果数据文件不是太大,这可能是一个可接受的简单解决方案。
另一种选择是在bytes
中包装一个任意迭代器(产生IterStream
)。这个非常通用的工具(thanks to Mechanical snail)允许您操作任何文件的内容,然后将其重新打包到类似文件的对象中。因此,您可以修复文件的内容,但仍然将其传递给任何需要类似文件的对象的函数,例如pd.read_csv
。 (注意:我使用相同的工具here回答了类似的问题。)
import io
import pandas as pd
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def clean(f):
for line in f:
yield line.strip()[1:-1]+b'\n'
with open('data', 'rb') as f:
# https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
print(df)
答案 4 :(得分:-1)
现在找到以下解决方案:
sep = '[|"|]'
使用多字符分隔符允许括号存储在pandas数据帧的不同列中,然后将其删除。这样可以避免必须删除行的单词行。