加快循环过滤字符串的速度

时间:2019-06-19 22:47:22

标签: python pandas loops

我想通过删除那些不包含关键字的tweet来过滤pandas数据框中包含tweets(3 +百万行)的列。为此,我正在运行以下循环(对不起,我是python的新手):

filter_word_indicators = []
for i in range(1, len(df)):
    if 'filter_word' in str(df.tweets[0:i]):
        indicator = 1 
    else:
        indicator = 0
    filter_word_indicators.append(indicator)

想法是,如果指标等于0,则丢弃推文。问题是此循环要花很长时间才能运行。我确定有更好的方法来删除不包含“ filer_word”的推文,但是我不知道如何编写它。任何帮助都会很棒。

2 个答案:

答案 0 :(得分:2)

签出pandas.Series.str.contains,您可以按以下方式使用它。

df[~df.tweets.str.contains('filter_word')]

MWE

In [0]: df = pd.DataFrame(
            [[1, "abc"],
             [2, "bce"]],
            columns=["number", "string"]
        )    
In [1]: df
Out[1]: 
   number string
0       1    abc
1       2    bce

In [2]: df[~df.string.str.contains("ab")]
Out[2]: 
   number string
1       2    bce

计时

对下面的合成DataFrame进行一次小型时序测试,该框架具有300万条鸣叫大小的随机字符串

df = pd.DataFrame(
    [
        "".join(random.choices(string.ascii_lowercase, k=280))
        for _ in range(3000000)
    ],
    columns=["strings"],
)

和关键字abc,比较原始解决方案map + regex和此建议的解决方案(str.contains)。结果如下。

original       99s
map + regex    21s
str.contains  2.8s

答案 1 :(得分:0)

我创建以下示例:

df = pd.DataFrame("""Suggested order for Amazon Prime Doctor Who series
Why did pressing the joystick button spit out keypresses?
Why tighten down in a criss-cross pattern?
What exactly is the 'online' in OLAP and OLTP?
How is hair tissue mineral analysis performed?
Understanding the reasoning of the woman who agreed with King Solomon to "cut the baby in half"
Can Ogre clerics use Purify Food and Drink on humanoid characters?
Heavily limited premature compiler translates text into excecutable python code
How many children?
Why are < or > required to use /dev/tcp
Hot coffee brewing solutions for deep woods camping
Minor traveling without parents from USA to Sweden
Non-flat partitions of a set
Are springs compressed by energy, or by momentum?
What is "industrial ethernet"?
What does the hyphen "-" mean in "tar xzf -"?
How long would it take to cross the Channel in 1890's?
Why do all the teams that I have worked with always finish a sprint without completion of all the stories?
Is it illegal to withhold someone's passport and green card in California?
When to remove insignificant variables?
Why does Linux list NVMe drives as /dev/nvme0 instead of /dev/sda?
Cut the gold chain
Why do some professors with PhDs leave their professorships to teach high school?
"How can you guarantee that you won't change/quit job after just couple of months?" How to respond?""".split('\n'), columns = ['Sentence'])

您可以使用正则表达式创建一个简单函数(大写字母时更灵活):

def tweetsFilter(s, keyword):
    return bool(re.match('(?i).*(' + keyword + ').*', s))

可以调用此函数来获取包含特定关键字的布尔布尔值系列。 map可以加快脚本运行速度(您需要测试!!!):

keyword = 'Why'
sel = df.Sentence.map(lambda x: tweetsFilter(x, keyword))
df[sel]

我们获得了:

    Sentence
1   Why did pressing the joystick button spit out ...
2   Why tighten down in a criss-cross pattern?
9   Why are < or > required to use /dev/tcp
17  Why do all the teams that I have worked with a...
20  Why does Linux list NVMe drives as /dev/nvme0 ...
22  Why do some professors with PhDs leave their p...