Question

我正在练习kaggle新闻标题数据集：https://www.kaggle.com/aaron7sun/stocknews#Combined_News_DJIA.csv

df = pd.read_csv('./data/Combined_News_DJIA.csv')

当阅读新闻标题的DataFrame时，我得到的是该系列的格式：

0       b"Georgia 'downs two Russian warplanes' as cou...
1       b'Why wont America &amp; Nato help us? If they w...
2       b'Remember that adorable 9-year-old who sang a...
3       b' U.S. refuses Israel weapons to attack Iran:...
4       b'All the experts admit that we should legalis...

我尝试使用以下内容：

df['Series'].str.decode("utf-8")

但是输出是NaN的列表。有任何想法吗？在整个DataFrame上（而不仅仅是一个系列）上实现都非常好。

Answer 1

您无法从UTF-8对其进行解码，因为它已经是一个字符串-而不是字节序列。

文件的内容确实令人困惑：它包含以"b'...开头的字符串，这误导了使用它以为它是字节的，但事实并非如此。

如果运行df.Top1[0]，则会看到它包含：

'b"Georgia \'downs two Russian warplanes\' as countries move to brink of war"'

type(df.Top1[0])只是一个字符串。因此-您无法从UTF-8对其进行解码。

解码字符串的熊猫返回NaN

1 个答案: