网址中的Pandas DataFrame匹配词

时间:2019-07-07 13:04:47

标签: python pandas

我有一个由熊猫创建的数据框。数据框中的一列具有URL,我想匹配并计算特定的出现次数。

我的逻辑是,如果它不返回“ None”,那么在此阶段print('Match')似乎不起作用。这是我当前代码的示例,不胜感激有关如何使用pandas匹配值的任何提示,因为我真的刚从使用大量R中回来,并且对Pandas和数据框架没有太多经验蟒蛇。

Title,URL,Date,Unique Pageviews
Preparing and Starting DS 
career,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:242750,20-Jan-15,163
The Rogue Data Scientist,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:273425,4-May-15,1108
Is it safe to code after one bottle of 
wine?,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:349416,9-Nov-15,1736
Short-Term Forecasting of Electricity 
Demand,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:350421,12-Nov-15,1117
Visual directory of 339 tools. 
Wow!,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:373786,14-Jan-16,4228
8 Types of Data,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:377008,23-Jan-16,2829
Very funny video for people who write 
code,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:379578,30-Jan-16,2444  

代码块(Pep8在函数之间需要两个行间距)

def count_set_words(as_pandas):
    reg_exp = re.match('\b/forum', as_pandas['URL']).any()
        if as_pandas['URL'].str.match(reg_exp, case=False, flags=0, na=np.NAN).any():
            print("Match")


def set_new_columns(as_pandas):
   titles_list = ['Year > 2014', 'Forum', 'Blog', 'Python', 'R',
               'Machine_Learning', 'Data_Science', 'Data', 'Analytics']
   for number, word in enumerate(titles_list):
       as_pandas.insert(len(as_pandas.columns), titles_list[number], 0)


def open_as_dataframe(file_name_in):
    reader = pd.read_csv(file_name_in, encoding='windows-1251')
    return reader


def main():
    multi_sets = open_as_dataframe('HDT_data5.txt')
    set_new_columns
    count_set_words(multi_sets)


main()

1 个答案:

答案 0 :(得分:1)

reg_exp第一行中的

count_words不是正则表达式,而是检查URL列中的元素是否与'\ b / forum'相匹配,我认为是这样的:

df = pd.read_csv(file_name_in, encoding='windows-1251')
for ix, row in df.iterrows():
    re.match('\b/forum', row['url']) is not None:
        print('this is a match')

将解决您的问题

或更简单的

df['is_a_match'] = df.url.apply(lambda row: re.match('\b/forum', row['url']) is not None)