Question

我必须从CSV文件中的大量Twitter数据中删除主题标签和对象（例如@）以及HTML链接。我正在使用以下代码，但似乎给出了错误。将不胜感激任何建议。谢谢。

import pandas as pd
corpus = pd.read_table('electionday.csv', delimiter=',', header=0, names=['text'])
' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ", corpus).split())

TypeError：预期的字符串或类似字节的对象

以下是一些推文的示例：

If Joseph Gordon-Levitt or Joe Maganiello need to bail on the US after Trump wins- I've got a spare bedroom. Just sayin'. #Election2016

@millberry80 makes my head hurt, I'm angrier with Democrat establishment than Trump voters. Missed the chance to change USA for the better.

What scares me more than Trump is the Republican majority in congress They are going to undo the progress this country has made during Obama

Will the Peasants manage to stop Hillary destroying their jobs &amp; the US economy with TPP? html t.co/ImxVGYboE3â€

Answer 1

re.sub使用字符串或字节。但是您将DataFrame传递给它（pd.read_table返回什么）。您应该遍历数据框架（语料库），并在每个单元格上调用re.sub和程序的其他部分

赞：

# load csv into dataframe
import pandas as pd
corpus = pd.read_table('electionday.csv', delimiter=',', header=0, names=['text'])

# walk through each data row:
for index, row in df.iterrows():
    # The tweet text itself:
    tweet_text = row['text']
    # Make your changes over the text:
    # (Maybe you'll want to store result somewhere, not just printing it, but it's up to you)
    print(' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ", tweet_text).split()))

TypeError：从csv文件中的tweet数据中删除＃@时，预期的字符串或类似字节的对象

1 个答案: