Python:返回2000个文本列表中的单词计数

时间:2016-12-15 01:50:46

标签: python regex pandas

我几乎可以肯定我忽略了一些非常明显的东西,所以我问这个问题希望是尴尬的:我有一个pandas数据框,在一列中有超过2000个文本。我最初的目标是,并且仍然是计算每个文本中的单词,并在数据框中使用该单词计数创建一个新列。

为了简化问题,我使用以下内容将文本列拉出到字符串列表中:

texts = data.text.tolist()

类型为list,列表的len为2113,即数据框中的行数。我目前的努力是:

word_counts = []
for text in texts:
    count = len(re.findall(r"[a-zA-Z_]+", text))
    word_counts.append(count)

我收到了:TypeError: expected string or buffer

如果我对单个文本进行评估:

len(re.findall(r"[a-zA-Z_]+", texts[0]))

我得到了预期的结果:2176。

我没看到什么?

编辑添加样本:

texts[0].split()[:10]

['Thank', 'you', 'so', 'much', 'Chris.', 'And', 
"it's", 'truly', 'a', 'great']

这些是谈话的成绩单,所以有些标点符号,也许还有一些数字。

2 个答案:

答案 0 :(得分:1)

您可以创建一个函数来返回每个字符串的len,并将该函数应用于包含字符串的pd.Series

data = pd.DataFrame(
    {'text': ["This is-four words.", "This is five whole words."]})
data
#   text
# 0 This is-four words.
# 1 This is five whole words.

def count_words(cell):
    try:
        return len(re.findall(r"[a-zA-Z_]+", cell))
    except AttributeError:
        return cell

data['word_count'] = data['text'].apply(count_words)
data

#   text                        word_count
# 0 This is-four words.         4
# 1 This is five whole words.   5

但是,如果你知道每个文本中的单词只用空格分隔(即不用下划线或短划线),那么我会推荐这种方法:

def count_words2(cell):
    try:
        return len(cell.split())
    except TypeError:
        return cell

count_words3 = lambda x: len(str(x).split())

它比使用正则表达式快得多。在Jupyter笔记本中:

test_str = "test " * 1000
%timeit count_words(test_str)
%timeit count_words2(test_str)
%timeit count_words3(test_str)
# 10000 loops, best of 3: 158 µs per loop
# 10000 loops, best of 3: 29.8 µs per loop
# 10000 loops, best of 3: 28.7 µs per loop

答案 1 :(得分:1)

我认为,您不必使用正则表达式,也不需要将值输出到列表中。您可以尝试使用USE YourDatabase GO --[SDUser] should be able to select from any table. SELECT * FROM dbo.Stooge --[SDUser] should be able to execute [AddStooge] proc. EXEC dbo.AddStooge 'Larry' EXEC dbo.AddStooge 'Carly' EXEC dbo.AddStooge 'Moo' --verify stooges added. SELECT * FROM dbo.Stooge --Fix spelling mistake. [SDUser] should be able to execute [UpdateStooge] proc. EXEC dbo.UpdateStooge 2, 'Curley' --Verify updated stooge SELECT * FROM dbo.Stooge --[SDUser] should be able to execute [DeleteStooge] proc. EXEC dbo.DeleteStooge 3 --Verify deleted stooge SELECT * FROM dbo.Stooge --This should fail. [SDUser] should not be able to alter proc. ALTER PROCEDURE dbo.DeleteStooge @StoogeId INT AS IF NOT EXISTS (SELECT * FROM dbo.Stooge WHERE ID = @StoogeId ) RAISERROR('Stooge does not exist', 16, 1); ELSE DELETE FROM dbo.Stooge WHERE ID = @StoogeId GO --[SDUser] should not be able to view the proc definition. ([ROUTINE_DEFINITION] is null) SELECT r.ROUTINE_NAME, r.ROUTINE_DEFINITION FROM INFORMATION_SCHEMA.ROUTINES r WHERE r.ROUTINE_NAME = 'DeleteStooge' GO --[SDUser] should be able to execute [DropStoogeTable] proc. EXEC dbo.DropStoogeTable; 功能:

lambda

然后你可以使用df = pd.DataFrame({'col1': ['Hello world', 'Hello, there world', 'Hello']}) col1 0 Hello world 1 Hello there world 2 Hello 函数。

lambda

或者,如果您想使用df['count'] = df['col1'].apply(lambda x: len(str(x).split())) col1 count 0 Hello world 2 1 Hello there world 3 2 Hello 1 ,您仍然可以使用regex功能:

lambda