Question

我几乎可以肯定我忽略了一些非常明显的东西，所以我问这个问题希望是尴尬的：我有一个pandas数据框，在一列中有超过2000个文本。我最初的目标是，并且仍然是计算每个文本中的单词，并在数据框中使用该单词计数创建一个新列。

为了简化问题，我使用以下内容将文本列拉出到字符串列表中：

texts = data.text.tolist()

类型为list，列表的len为2113，即数据框中的行数。我目前的努力是：

word_counts = []
for text in texts:
    count = len(re.findall(r"[a-zA-Z_]+", text))
    word_counts.append(count)

我收到了：TypeError: expected string or buffer。

如果我对单个文本进行评估：

len(re.findall(r"[a-zA-Z_]+", texts[0]))

我得到了预期的结果：2176。

我没看到什么？

编辑添加样本：

texts[0].split()[:10]

['Thank', 'you', 'so', 'much', 'Chris.', 'And', 
"it's", 'truly', 'a', 'great']

这些是谈话的成绩单，所以有些标点符号，也许还有一些数字。

Answer 1

您可以创建一个函数来返回每个字符串的len，并将该函数应用于包含字符串的pd.Series。

data = pd.DataFrame(
    {'text': ["This is-four words.", "This is five whole words."]})
data
#   text
# 0 This is-four words.
# 1 This is five whole words.

def count_words(cell):
    try:
        return len(re.findall(r"[a-zA-Z_]+", cell))
    except AttributeError:
        return cell

data['word_count'] = data['text'].apply(count_words)
data

#   text                        word_count
# 0 This is-four words.         4
# 1 This is five whole words.   5

但是，如果你知道每个文本中的单词只用空格分隔（即不用下划线或短划线），那么我会推荐这种方法：

def count_words2(cell):
    try:
        return len(cell.split())
    except TypeError:
        return cell

count_words3 = lambda x: len(str(x).split())

它比使用正则表达式快得多。在Jupyter笔记本中：

test_str = "test " * 1000
%timeit count_words(test_str)
%timeit count_words2(test_str)
%timeit count_words3(test_str)
# 10000 loops, best of 3: 158 µs per loop
# 10000 loops, best of 3: 29.8 µs per loop
# 10000 loops, best of 3: 28.7 µs per loop

Answer 2

我认为，您不必使用正则表达式，也不需要将值输出到列表中。您可以尝试使用USE YourDatabase GO --[SDUser] should be able to select from any table. SELECT * FROM dbo.Stooge --[SDUser] should be able to execute [AddStooge] proc. EXEC dbo.AddStooge 'Larry' EXEC dbo.AddStooge 'Carly' EXEC dbo.AddStooge 'Moo' --verify stooges added. SELECT * FROM dbo.Stooge --Fix spelling mistake. [SDUser] should be able to execute [UpdateStooge] proc. EXEC dbo.UpdateStooge 2, 'Curley' --Verify updated stooge SELECT * FROM dbo.Stooge --[SDUser] should be able to execute [DeleteStooge] proc. EXEC dbo.DeleteStooge 3 --Verify deleted stooge SELECT * FROM dbo.Stooge --This should fail. [SDUser] should not be able to alter proc. ALTER PROCEDURE dbo.DeleteStooge @StoogeId INT AS IF NOT EXISTS (SELECT * FROM dbo.Stooge WHERE ID = @StoogeId ) RAISERROR('Stooge does not exist', 16, 1); ELSE DELETE FROM dbo.Stooge WHERE ID = @StoogeId GO --[SDUser] should not be able to view the proc definition. ([ROUTINE_DEFINITION] is null) SELECT r.ROUTINE_NAME, r.ROUTINE_DEFINITION FROM INFORMATION_SCHEMA.ROUTINES r WHERE r.ROUTINE_NAME = 'DeleteStooge' GO --[SDUser] should be able to execute [DropStoogeTable] proc. EXEC dbo.DropStoogeTable;功能：

lambda

然后你可以使用df = pd.DataFrame({'col1': ['Hello world', 'Hello, there world', 'Hello']}) col1 0 Hello world 1 Hello there world 2 Hello函数。

lambda

或者，如果您想使用df['count'] = df['col1'].apply(lambda x: len(str(x).split())) col1 count 0 Hello world 2 1 Hello there world 3 2 Hello 1，您仍然可以使用regex功能：

lambda

Python：返回2000个文本列表中的单词计数

2 个答案: