samedi 9 mai 2015

Python regex split but put end part of regex match back into string?

I'd like to find a regex expression that can break up paragraphs (long strings, no newline characters to worry about) into sentences with the simple rule that an of {., ?, !} followed by a whitespace and then a capital letter should be the end of the sentence (I realize this is not a good rule for real life).

I've got something partly working, but it doesn't quite do the job:

line = 'a b c FFF! D a b a a FFF. gegtat FFF. A'
matchObj = re.split(r'(.*?\sFFF[\.|\?|\!])\s[A-Z]', line)
print (matchObj)

prints

['', 'a b c FFF!', '', ' a b a a FFF. gegtat FFF.', '']

whereas I'd like to get:

['a b c FFF!', 'D a b a a FFF. gegtat FFF.']

So two questions.

  • Why are there empty members ('') in the results?

  • I understand why the D gets cut out from the split result - it's part of the first search. How can I structure my search differently so that the capital letter coming after the punctuation is put back so it can be included with the next sentence? In this case, how can I get D to turn up in the second element of the split result?

I know I could accomplish this with some sort of for-loop just peeling off the first result, adding back the capital letter and then doing it all over again, but this seems not-so-Pythonic. If regex is not the way to go here, is there something that still avoids the for loop?

Thanks for any suggestions.

Aucun commentaire:

Enregistrer un commentaire