Question

I have a dataframe that looks like this:

Sentence                                                           bin_class
"i wanna go to sleep. too late to take seroquel."                      1
"Adam and Juliana are leaving me for 43 days take me with youuuu!"     0

And I also have a list of regex patterns I want to use on these sentences. What I want to do is re.search every pattern in my list on every every sentence in the dataframe and create a new column in the data frame that has a 1 if there is a matching regex and a zero otherwise. I have been able to run the regex patterns against the sentences in the dataframe to create a list of matches but am not sure how to create a new column on the data frame.

matches = []
for x in df['sentence']:
    for i in regex:
        match = re.search(i,x)
        if match:
            matches.append((x,i))

Answer 1

You can probably use the str.count string method. A small example:

In [25]: df
Out[25]:
                                            Sentence  bin_class
0    i wanna go to sleep. too late to take seroquel.          1
1  Adam and Juliana are leaving me for 43 days ta...          0

In [26]: df['Sentence'].str.count(pat='to')
Out[26]:
0    3
1    0
Name: Sentence, dtype: int64

This method also accepts a regex pattern. If you just want the occurence and not the count, contains is probably enough:

In [27]: df['Sentence'].str.contains(pat='to')
Out[27]:
0     True
1    False
Name: Sentence, dtype: bool

So with this you can loop through your regex patterns and then each time add a column with the above.

See the documentation on this for more examples: http://pandas.pydata.org/pandas-docs/stable/text.html#testing-for-strings-that-match-or-contain-a-pattern

Pandas add column to df based on list of regex patterns

1 个答案: