我在一个称为“文本”(每行1个文本)的列中有一个包含500个文本的数据框,我想计算所有文本中最常用的词。
到目前为止,我一直尝试(这两种方法都来自stackoverflow):
pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]
和
Counter(" ".join(df["Text"]).split()).most_common(100)
都给了我以下错误:
TypeError:序列项0:预期的str实例,找到了列表
我已经尝试使用
df.Text.apply(Counter())
这给了我每个文字的字数 而且我还更改了counter方法,使其在每个文本中返回最常用的单词
但是我想要总体上最常用的词
这里是数据框的示例(文本已小写,标点符号已清除,已标记化并且停用词已删除)
Datum File File_type Text length len_cleaned_text
Datum
2000-01-27 2000-01-27 _04.txt _04 [business, date, jan, heineken, starts, integr... 396 220
编辑:“取消”代码
for file in file_list:
name = file[len(input_path):]
date = name[11:17]
type_1 = name[17:20]
with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
format
text = rfile.read()
text = text.encode('utf-8', 'ignore')
text = text.decode('utf-8', 'ignore')
a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
result_list.append(a)
新单元格
df['Text']= df['Text'].str.lower()
p = re.compile(r'[^\w\s]+')
d = re.compile(r'\d+')
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('\n',' ')
df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
df['Text']=df['Text'].apply(word_tokenize)
Datum File File_type Text length the
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr... 396 0
2000-02-01 2000-02-01 0910068040_000201_04.txt _04 [group, english, cns, date, feb, bat, acquisit... 305 0
2000-05-03 2000-05-03 1070448040_000503_04.txt _04 [date, may, cobham, plc, cob, acquisitionsdisp... 701 0
2000-05-11 2000-05-11 0865985020_000511_04.txt _04 [business, date, may, swedish, match, complete... 439 0
2000-11-28 2000-11-28 1067252020_001128_04.txt _04 [date, nov, intec, telecom, sys, itl, doc, pla... 158 0
2000-12-18 2000-12-18 1963867040_001218_04.txt _04 [associated, press, apw, date, dec, volvo, div... 367 0
2000-12-19 2000-12-19 1065767020_001219_04.txt _04 [date, dec, spirent, plc, spt, acquisition, co... 414 0
2000-12-21 2000-12-21 1076829040_001221_04.txt _04 [bloomberg, news, bn, date, dec, eni, ceo, cfo... 271 0
2001-02-06 2001-02-06 1084749020_010206_04.txt _04 [date, feb, chemring, group, plc, chg, acquisi... 130 0
2001-02-15 2001-02-15 1063497040_010215_04.txt _04 [date, feb, electrolux, ab, elxb, acquisition,... 420 0
以及对数据框的描述:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum 557 non-null datetime64[ns]
File 557 non-null object
File_type 557 non-null object
Text 557 non-null object
customers 557 non-null int64
grwoth 557 non-null int64
human 557 non-null int64
intagibles 557 non-null int64
length 557 non-null int64
synergies 557 non-null int64
technology 557 non-null int64
the 557 non-null int64
len_cleaned_text 557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB
预先感谢
答案 0 :(得分:1)
好,知道了。您的df['Text']
由文本列表组成。因此,您可以这样做:
full_list = [] # list containing all words of all texts
for elmnt in df['Text']: # loop over lists in df
full_list += elmnt # append elements of lists to full list
val_counts = pd.Series(full_list).value_counts() # make temporary Series to count
此解决方案避免使用太多列表推导,从而使代码易于阅读和理解。此外,不需要像re
或collections
这样的附加模块。
答案 1 :(得分:1)
这是我的版本,其中我将列值转换为列表,然后创建单词列表,对其进行清理,然后您有了计数器:
your_text_list = df['Text'].tolist()
your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]
counter = collections.Counter(flat_list)
top_words = counter.most_common(100)
答案 2 :(得分:0)
您可以通过apply
和Counter.update
方法进行操作:
from collections import Counter
counter = Counter()
df = pd.DataFrame({'Text': values})
_ = df['Text'].apply(lambda x: counter.update(x))
counter.most_common(10)
Out:
[('Amy', 3), ('was', 3), ('hated', 2),
('Kamal', 2), ('her', 2), ('and', 2),
('she', 2), ('She', 2), ('sent', 2), ('text', 2)]
df['Text']
在哪里:
0 [Amy, normally, hated, Monday, mornings, but, ...
1 [Kamal, was, in, her, art, class, and, she, li...
2 [She, was, waiting, outside, the, classroom, w...
3 [Hi, Amy, Your, mum, sent, me, a, text]
4 [You, forgot, your, inhaler]
5 [Why, don’t, you, turn, your, phone, on, Amy, ...
6 [She, never, sent, text, messages, and, she, h...
Name: Text, dtype: object