我几乎可以肯定我忽略了一些非常明显的东西,所以我问这个问题希望是尴尬的:我有一个pandas
数据框,在一列中有超过2000个文本。我最初的目标是,并且仍然是计算每个文本中的单词,并在数据框中使用该单词计数创建一个新列。
为了简化问题,我使用以下内容将文本列拉出到字符串列表中:
texts = data.text.tolist()
类型为list
,列表的len
为2113,即数据框中的行数。我目前的努力是:
word_counts = []
for text in texts:
count = len(re.findall(r"[a-zA-Z_]+", text))
word_counts.append(count)
我收到了:TypeError: expected string or buffer
。
如果我对单个文本进行评估:
len(re.findall(r"[a-zA-Z_]+", texts[0]))
我得到了预期的结果:2176。
我没看到什么?
编辑添加样本:
texts[0].split()[:10]
['Thank', 'you', 'so', 'much', 'Chris.', 'And',
"it's", 'truly', 'a', 'great']
这些是谈话的成绩单,所以有些标点符号,也许还有一些数字。
答案 0 :(得分:1)
您可以创建一个函数来返回每个字符串的len
,并将该函数应用于包含字符串的pd.Series
。
data = pd.DataFrame(
{'text': ["This is-four words.", "This is five whole words."]})
data
# text
# 0 This is-four words.
# 1 This is five whole words.
def count_words(cell):
try:
return len(re.findall(r"[a-zA-Z_]+", cell))
except AttributeError:
return cell
data['word_count'] = data['text'].apply(count_words)
data
# text word_count
# 0 This is-four words. 4
# 1 This is five whole words. 5
但是,如果你知道每个文本中的单词只用空格分隔(即不用下划线或短划线),那么我会推荐这种方法:
def count_words2(cell):
try:
return len(cell.split())
except TypeError:
return cell
count_words3 = lambda x: len(str(x).split())
它比使用正则表达式快得多。在Jupyter笔记本中:
test_str = "test " * 1000
%timeit count_words(test_str)
%timeit count_words2(test_str)
%timeit count_words3(test_str)
# 10000 loops, best of 3: 158 µs per loop
# 10000 loops, best of 3: 29.8 µs per loop
# 10000 loops, best of 3: 28.7 µs per loop
答案 1 :(得分:1)
我认为,您不必使用正则表达式,也不需要将值输出到列表中。您可以尝试使用USE YourDatabase
GO
--[SDUser] should be able to select from any table.
SELECT * FROM dbo.Stooge
--[SDUser] should be able to execute [AddStooge] proc.
EXEC dbo.AddStooge 'Larry'
EXEC dbo.AddStooge 'Carly'
EXEC dbo.AddStooge 'Moo'
--verify stooges added.
SELECT * FROM dbo.Stooge
--Fix spelling mistake. [SDUser] should be able to execute [UpdateStooge] proc.
EXEC dbo.UpdateStooge 2, 'Curley'
--Verify updated stooge
SELECT * FROM dbo.Stooge
--[SDUser] should be able to execute [DeleteStooge] proc.
EXEC dbo.DeleteStooge 3
--Verify deleted stooge
SELECT * FROM dbo.Stooge
--This should fail. [SDUser] should not be able to alter proc.
ALTER PROCEDURE dbo.DeleteStooge
@StoogeId INT
AS
IF NOT EXISTS (SELECT * FROM dbo.Stooge WHERE ID = @StoogeId )
RAISERROR('Stooge does not exist', 16, 1);
ELSE
DELETE FROM dbo.Stooge
WHERE ID = @StoogeId
GO
--[SDUser] should not be able to view the proc definition. ([ROUTINE_DEFINITION] is null)
SELECT r.ROUTINE_NAME, r.ROUTINE_DEFINITION
FROM INFORMATION_SCHEMA.ROUTINES r
WHERE r.ROUTINE_NAME = 'DeleteStooge'
GO
--[SDUser] should be able to execute [DropStoogeTable] proc.
EXEC dbo.DropStoogeTable;
功能:
lambda
然后你可以使用df = pd.DataFrame({'col1': ['Hello world', 'Hello, there world', 'Hello']})
col1
0 Hello world
1 Hello there world
2 Hello
函数。
lambda
或者,如果您想使用df['count'] = df['col1'].apply(lambda x: len(str(x).split()))
col1 count
0 Hello world 2
1 Hello there world 3
2 Hello 1
,您仍然可以使用regex
功能:
lambda