Question

我在下面用ID和text列创建了一个虚拟数据集，其中的string列包含一些公司的名称。

  # create dummy data frame with text columns
    x=[1,2,3,4,5]
    y=['apple google microsoft spotify alibaba','google microsoft','spotify google microsoft amazon','amazon google apple','amazon google spotify amazon']
    df=pd.DataFrame({'ID':x,'text':y})
    df

我还有另一个列表，其中也包含公司名称

# create list of companies
listtry=['apple','google','microsoft','spotify','alibaba','amazon','structo']

我想做的是计算每个公司在主数据框文本列中出现的行数，而不是整个文本列字符串中出现的实际次数

下面的代码给出了实际发生的次数

    # search amd count 
df2 = list()
for company in listtry :
    df2.append(df.text.str.count(company).sum())
df3=pd.DataFrame({'company':listtry,'count':df2})
df4=df3.sort_values('count',ascending=False)
df4

# gives results

     company  count
1     google      5
5     amazon      4
2  microsoft      3
3    spotify      3
0      apple      2
4    alibaba      1
6    structo      0

预期输出是Amazon应该是3倍，因为它仅出现在3行中，但是在最后一个字符串中出现两次，因此总数为4。

Answer 1

另一种尝试，将count更改为contains并采用df的长度：

for company in listtry :
    df2.append(len(df[df.text.str.contains(company)]))  # only changes here

Answer 2

您为什么不使用<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script> <table class="table table-bordered myTabl"> <tr style="background:#ff0"> <td>...</td> </tr> <tr style="background:#f00"> <td>...</td> </tr> <tr style="background:#ff0"> <td>...</td> </tr> <tr style="background:#f00"> <td>...</td> </tr> </table>删除重复项？（请参见第三行）

set

计算字数-字符串列中的唯一时间

2 个答案: