在文本中搜索关键字并为每个找到的关键字创建数据框列?

时间:2017-07-27 12:07:21

标签: python-3.x pandas web-scraping

我正在搜索一些网页上的关键字。再次感谢@Abdou帮我解决silent error handling!我给你举个例子:

DateTime date = DateTime.Now;
            string strDateTime = date.ToString("yyyy-MM-dd");


            var exec = db.Database.ExecuteSqlCommand("sp_InsertTicketChat @TicketId, @FullName, @Description, @LastCorrespondanceOn, @LastCorrespondanceBy",
                new SqlParameter("@TicketId", TicketId),
                new SqlParameter("@FullName", FullName),
                new SqlParameter("@Description", Description),
                new SqlParameter("@LastCorrespondanceBy", "raza"),
                new SqlParameter("@LastCorrespondanceOn", strDateTime)
                );

如您所见,我请求# this is basically what I do import pandas as pd import requests data = [{"URLs" : "https://www.mercedes-benz.de", "electric" : 1}, {"URLs" : "https://www.audi.de", "electric" : 0}, {"URLs" : "https://ww.audo.de", "electric" : 0}, {"URLs" : "NaN", "electric" : 0}] def contains_keywords(link, keywords): try: output = requests.get(link).text return int(any(x in output for x in keywords)) except: return "Wrong/Missing URL" df = pd.DataFrame(data) mykeywords = ('car', 'vehicle', 'automobile') df['extra_column'] = df.URLs.apply(lambda l: contains_keywords(l, mykeywords)) 中存储的网址并搜索df.data中的关键字,并将二进制结果存储在mykeywords中。该脚本基本上产生以下结果:

extra_column

到目前为止,我只知道,如果我找到一个关键字。但我想知道,我找到了哪些关键字,而不是# URLs electric extra_column # 0 https://www.mercedes-benz.de 1 1 # 1 https://www.audi.de 0 1 # 2 https://ww.audo.e 0 0 # 3 NaN 0 Wrong/Missing URL 分别为contains_keywords()中的每个关键字运行mykeywords。有没有办法为每个关键字创建一个新列,并将结果(1 =找到的关键字)存储在DataFrame中?也就是说:我需要在df中为每个关键字添加其他列。

1 个答案:

答案 0 :(得分:1)

import pandas as pd
import requests


data = [{"URLs" : "https://www.mercedes-benz.de", "electric" : 1},
        {"URLs" : "https://www.audi.de", "electric" : 0}, 
        {"URLs" : "https://ww.audo.de", "electric" : 0}, 
        {"URLs" : "NaN", "electric" : 0}]


def contains_keywords(link, keyword):
    try:
        output = requests.get(link).text
        return int(keyword in output)
    except:
        return "Wrong/Missing URL"


df = pd.DataFrame(data)
mykeywords = ('car', 'vehicle', 'automobile')
for keyword in mykeywords:
    df[keyword] = df.URLs.apply(lambda l: contains_keywords(l, keyword))