我正在阅读python中的一个csv文件,该文件在一列中有许多疾病名称,而在另一列中有相关的研究人员。该文件看起来像这样 -
[Table 1]
Terms Researcher
1.Asthma Dr. Roberts
2.Brochial cancer Dr. Lee
3.HIV Dr.Roberts
4.HIV Dr. Lee
5.Influenzae Dr. Wang
6.Bronchial Cancer Dr. Wang
7.Influenzae Dr. Roberts
8.dengue prof. christopher
9.Arthritis prof. swaminathan
10.Arthritis prof. christopher
11.Asthma Dr. Roberts
12.HIV Dr. Lee
13.Bronchial Cancer Dr. Wang
14.dengue prof. christopher
15.HIV prof. christopher
16.HIV Dr. Lee
我希望我的代码遍历每一行并增加与每个研究人员关联的术语的频率计数,这样当用户输入他/她正在寻找的术语时,他们应该获得这样的输出表 - < / p>
Term you are looking for : HIV
Names of the researchers Frequency
Dr. Roberts 1
Dr. Lee 3
prof. christopher 1
现在让我们来看看我在做什么 -
In[1]:
import pandas as pd
import numpy as np
data = pd.read_csv("Researchers Title Terms.csv")
data.head()
给了我[表1] 然后我这样做 -
In[2]:
term = input("Enter the term you are looking for:")
term = term.lower()
list_of_terms = []
for row in data:
if row[data.Terms] == term
researcher1 += 1
elif data.Terms == term
researcher2 += 1
elif data.Terms == term
researcher3 += 1
else
print("Sorry!", term, "not found in the database!")
print("Term you are looking for : ", term)
print("Dr. Roberts:", researcher1)
print("Dr. Lee:", researcher2)
print("prof. christopher:", researcher3)
我到达的所有地方都是 -
File "<ipython-input-9-b85d0d187059>", line 5
if row[data.Terms] == term
^
SyntaxError: invalid syntax
我是python编程的初学者,所以不太确定我的逻辑是完全错误还是确实存在一些语法错误。任何帮助将不胜感激。在尝试了一些事情并且没有输出之后我将它放在社区上。 提前谢谢!
答案 0 :(得分:5)
if DATEDIFF('day',[Start],[End]) =0
and DATEPART('minute',[End])-DATEPART('minute',[Start])>=0
then DATEPART('minute',[End])-DATEPART('minute',[Start])
elseif DATEDIFF('day',[Start],[End]) =0
and DATEPART('minute',[End])-DATEPART('minute',[Start])<0
then 60 + DATEPART('minute',[End])-DATEPART('minute',[Start])
elseif DATEDIFF('day',[Start],[End]) !=0
then [Start Min]+[End Min]
end
和groupby
简单直观
value_counts
您可以使用df.Terms = df.Terms.str.replace('\d+\.\s*', '').str.upper()
df.Researcher = df.Researcher.str.title()
s = df.groupby('Terms').Researcher.value_counts()
s
Terms Researcher
ARTHRITIS Prof. Christopher 1
Prof. Swaminathan 1
ASTHMA Dr. Roberts 2
BROCHIAL CANCER Dr. Lee 1
BRONCHIAL CANCER Dr. Wang 2
DENGUE Prof. Christopher 2
HIV Dr. Lee 3
Dr.Roberts 1
Prof. Christopher 1
INFLUENZAE Dr. Roberts 1
Dr. Wang 1
Name: Researcher, dtype: int64
或loc
xs
或者
s.loc['HIV']
Researcher
Dr. Lee 3
Dr.Roberts 1
Prof. Christopher 1
Name: Researcher, dtype: int64
s.xs('HIV')
Researcher
Dr. Lee 3
Dr.Roberts 1
Prof. Christopher 1
Name: Researcher, dtype: int64
和pd.factorize
np.bincount
您可以按照与上述相同的方式访问。
答案 1 :(得分:1)
在Python中,创建if,elif,for循环等时。正确的语法是在初始化行的末尾加上冒号。因此,在您的代码中,您需要将其更新为以下内容:
for row in data:
if row[data.Terms] == term:
researcher1 += 1
elif data.Terms == term:
researcher2 += 1
elif data.Terms == term:
researcher3 += 1
else:
print("Sorry!", term, "not found in the database!")
此外,一旦你纠正了这一点,根据你的代码看起来你也会有一个bug。您将用户输入设置为小写但您对从CSV文件读取的数据执行的操作不同。因此,这些术语都不等于用户输入。
答案 2 :(得分:1)
将数据读入熊猫。接受输入,然后filter
,groupby
&amp; size
给出了所需的结果
term = input("Enter the term you are looking for:")
data[data.Term.str.lower() == term.lower()].groupby('Researcher').size()
# Output with term = 'HIV'
Dr. Lee 3
Dr.Roberts 1
prof. christopher 1
dtype: int64
在这种方法中,没有显示与术语无关的研究人员(即大小== 0)。
为了向研究人员展示没有零计数的术语,首先建立一个研究人员的数据框,然后将结果数据框加入其中。
researchers = pd.DataFrame({'Researcher': data.Researcher.unique()})
out = data[data.Term.str.lower() == term.lower()].groupby('Researcher').agg({'Terms': 'size'})
pd.merge(reserachers, out, how='outer').fillna(0).sort_values('Terms', ascending=False)
# outputs:
Researcher Terms
1 Dr. Lee 3.0
2 Dr.Roberts 1.0
4 prof. christopher 1.0
0 Dr. Roberts 0.0
3 Dr. Wang 0.0
5 prof. swaminathan 0.0
答案 3 :(得分:1)
您可以采用与您正在执行的操作类似的方式遍历数据框,但由于您使用的是pandas
,因此可能值得利用pandas
函数。它们通常比迭代快得多,代码看起来更干净。
term_of_interest = 'HIV'
(df.groupby('Researcher')
.apply(lambda x: x.Terms.str.contains(term_of_interest)
.sum())
.rename('Frequency').to_frame())
Frequency
Researcher
Dr. Lee 3
Dr. Roberts 0
Dr. Wang 0
Dr.Roberts 1
prof. christopher 1
prof. swaminathan 0
答案 4 :(得分:0)
from collections import Counter
from pprint import pprint
if __name__ == '__main__':
docs = ["Dr.Roberts",
"Dr.Lee",
"Dr.Roberts",
"Dr.Lee",
"Dr.Wang",
"Dr.Wang",
"Dr.Roberts",
"prof.christopher",
"prof.swaminathan",
"prof.christopher",
"Dr.Roberts",
"Dr.Lee",
"Dr.Wang",
"prof.christopher",
"prof.christopher",
"Dr.Lee"]
pprint(Counter(docs).most_common(5))