我有一个由熊猫创建的数据框。数据框中的一列具有URL,我想匹配并计算特定的出现次数。
我的逻辑是,如果它不返回“ None”,那么在此阶段print('Match')似乎不起作用。这是我当前代码的示例,不胜感激有关如何使用pandas匹配值的任何提示,因为我真的刚从使用大量R中回来,并且对Pandas和数据框架没有太多经验蟒蛇。
Title,URL,Date,Unique Pageviews
Preparing and Starting DS
career,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:242750,20-Jan-15,163
The Rogue Data Scientist,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:273425,4-May-15,1108
Is it safe to code after one bottle of
wine?,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:349416,9-Nov-15,1736
Short-Term Forecasting of Electricity
Demand,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:350421,12-Nov-15,1117
Visual directory of 339 tools.
Wow!,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:373786,14-Jan-16,4228
8 Types of Data,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:377008,23-Jan-16,2829
Very funny video for people who write
code,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:379578,30-Jan-16,2444
代码块(Pep8在函数之间需要两个行间距)
def count_set_words(as_pandas):
reg_exp = re.match('\b/forum', as_pandas['URL']).any()
if as_pandas['URL'].str.match(reg_exp, case=False, flags=0, na=np.NAN).any():
print("Match")
def set_new_columns(as_pandas):
titles_list = ['Year > 2014', 'Forum', 'Blog', 'Python', 'R',
'Machine_Learning', 'Data_Science', 'Data', 'Analytics']
for number, word in enumerate(titles_list):
as_pandas.insert(len(as_pandas.columns), titles_list[number], 0)
def open_as_dataframe(file_name_in):
reader = pd.read_csv(file_name_in, encoding='windows-1251')
return reader
def main():
multi_sets = open_as_dataframe('HDT_data5.txt')
set_new_columns
count_set_words(multi_sets)
main()
答案 0 :(得分:1)
reg_exp
第一行中的 count_words
不是正则表达式,而是检查URL列中的元素是否与'\ b / forum'相匹配,我认为是这样的:
df = pd.read_csv(file_name_in, encoding='windows-1251')
for ix, row in df.iterrows():
re.match('\b/forum', row['url']) is not None:
print('this is a match')
将解决您的问题
或更简单的
df['is_a_match'] = df.url.apply(lambda row: re.match('\b/forum', row['url']) is not None)