Question

I have some strings I am trying to extract from an HTML file. There are many of these strings (long story) and they generally follow the pattern of "Xsomestuffwrittenhere!", with an "X" at the beginning and "!" at the end.

I have written some messy as hell code extracting most of these from the HTML but I'm finding trouble dealing with cases where there is an "!" in the middle of the sections I wish to extract. e.g.

XWTF!ThatMakesNoSense!

I have been using .find() to get the indexes of the passages to chop them out in the lines of the HTML. e.g.:

line[line.find("X"):line.find("!")+1]

Within the HTML file (for context: a Facebook messages transcript) everything is formatted all weird (screw you Zuckerberg) and so a X...! can have any sort of text or whatever on either side. I point this out because I have had to add into my code

re.search(" ", line[line.find("X"):line.find("!")])

to make sure a later "!" doesn't mess with my indexing. e.g.:

Xsomething! This is a new sentence!

So, the problem I'm having is that I can't work out what to do to distinguish when an "!" is appearing in the middle of a section with an "!" at the end which I want to extract.

I guess the basic problem boils down to: how can I find the last instance of stringA before the first instance of stringB, stringB being, in this case, a blank space.

I hope this is all making sense. And sorry for any hopelessness on my part. I haven't programmed in a year since I did this one Python module, and have come back mainly to do this for a project.

Answer 1

First thing: you shouldn't be parsing HTML via simple string processing; you should try using BeautifulSoup.

Regardless, try something like this:

matches = re.findall(r'X\S+!', my_input_string)
print(matches)

Finding last instance of stringA before StringB in Python

1 个答案: