Question

我有一些麻烦要读这样的数据集：

#    title
#    description
#    link (could be not still active)
#    id
#    date
#    source (nyt|us|reuters)
#    category

示例：

court agrees to expedite n.f.l.'s appeal\n
the decision means a ruling could be made nearly two months before the regular season begins, time for the sides to work out a deal without delaying the 
season.\n
http://feeds1.nytimes.com/~r/nyt/rss/sports/~3/nbjo7ygxwpc/04nfl.html\n
0\n
04 May 2011 07:39:03\n
nyt\n
sport\n

我试过了：

columns = ['title', 'description', 'link', 'id', 'date', 'source', 'category']
df = pd.read_csv('news', delimiter = "\n", names = columns,error_bad_lines=False)

但它将所有信息都放在了列标题中。

有人知道处理这个问题的方法吗？

谢谢！

Answer 1

您不能使用\n作为csv的分隔符，您可以做的是将索引设置为等于列名，然后进行转置，即

df = pd.read_csv('news', index=columns).transpose()

Answer 2

以下是一些需要注意的事项：

1）Pandas将任何长度超过1个字符的分隔符解释为正则表达式。

2）因为＆＃39; c＆＃39;引擎不支持正则表达式，我已明确定义引擎为＆＃39; python＆＃39;避免警告。

3）我不得不添加一个虚拟列，因为有一个＆＃39; \ n＆＃39;在文件的末尾，我后来使用drop删除了该列。

所以，这些行有望成为你想要的结果。

columns = ['title', 'description', 'link', 'id', 'date', 'source', 'category','dummy']
df = pd.read_csv('news', names=columns, delimiter="\\\\n", engine='python').drop('dummy',axis=1)
df

我希望这会有所帮助：）

用这种数据集读取带有pandas的csv

2 个答案: