我当前正在预处理推文,通过Twitter API提取并保存为csv。在csv中,tweet的开头有一些字符,例如“ b'”,而诸如aren \ xe2 \ x80 \ x99t之类的代码则代表“ '”。现在,我想删除这些字符,但是尽管尝试了几次,但不知道如何。谁能帮我?我使用pandas和Python3读取了文件。该列称为“ 文本”
我的意思是:
b'RT @username: some text some text C\xe2\x80\xa6' OR
"b'RT @username: some text some text .A\xe2\x80\xa6'
输入1:
df = pd.read_csv('Data/test.csv', encoding= 'utf8')
df['text'] = df['text'].str.replace('b[\s]+', ' ')
df['text'] = df['text'].str.replace('[^\x00-\x7F]+',' ')
df['text'] = df['text'].str.replace('[^\u0000-\uD7FF\uE000-\uFFFF]',' ')
输出1:没有任何反应。
在下一个代码段中,我尝试应用UTF-8编码。在我写这篇文章时,有时需要做进一步处理。
输入2:
df = pd.read_csv('Data/Result_w8_Pfizer_en_test.csv', encoding= 'utf8')
df.apply(lambda x: pd.lib.infer_dtype(x.values))
输出2:
AttributeError Traceback (most recent call last)
<ipython-input-50-4c6bdb11d736> in <module>
25
26 df = pd.read_csv('Data/test.csv', encoding= 'utf8') # dtype=string
---> 27 df.apply(lambda x: pd.lib.infer_dtype(x.values))
28
29
~/conda/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6485 args=args,
6486 kwds=kwds)
-> 6487 return op.get_result()
6488
6489 def applymap(self, func):
~/conda/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
149 return self.apply_raw()
150
--> 151 return self.apply_standard()
152
153 def apply_empty_result(self):
~/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
255
256 # compute the result using the series generator
--> 257 self.apply_series_generator()
258
259 # wrap results
~/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
284 try:
285 for i, v in enumerate(series_gen):
--> 286 results[i] = self.f(v)
287 keys.append(v.name)
288 except Exception as e:
<ipython-input-50-4c6bdb11d736> in <lambda>(x)
25
26 df = pd.read_csv('Data/test.csv', encoding= 'utf8')
---> 27 df.apply(lambda x: pd.lib.infer_dtype(x.values))
28
29
AttributeError: ("module 'pandas' has no attribute 'lib'", 'occurred at index date')
我在这里做了一些研究,但找不到问题或解决方法。