Try this @"(\S.+?[.!?])(?=\s+|$)"
:
string str=@"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";
Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)");
foreach (Match match in rx.Matches(str)) {
int i = match.Index;
Console.WriteLine(match.Value);
}
Results:
Hello world!
How are you?
I am fine.
This is a difficult sentence because I use I.D.
Newlines should also be accepted.
Numbers should not cause sentence breaks, like 1.23.
For complicated ones, of course, you will need a real parser like SharpNLP or NLTK. Mine is just a quick and dirty one.
Here is the SharpNLP info, and features:
SharpNLP is a collection of natural
language processing tools written in
C#. Currently it provides the
following NLP tools:
- a sentence splitter
- a tokenizer
- a part-of-speech tagger
- a chunker (used to “find non-recursive syntactic annotations such as noun phrase chunks”)
- a parser
- a name finder
- a coreference tool
- an interface to the WordNet lexical database