30 lines
1.3 KiB
Markdown
30 lines
1.3 KiB
Markdown
---
|
|
parent: Decision Records
|
|
nav_order: 18
|
|
---
|
|
# Use regular expression to split multiple-sentence titles
|
|
|
|
## Context and Problem Statement
|
|
|
|
Some entry titles are composed of multiple sentences, for example: "Whose Music? A Sociology of Musical Language", therefore, it is necessary to first split the title into sentences and process them individually to ensure proper formatting using '[Sentence Case](https://en.wiktionary.org/wiki/sentence_case)' or '[Title Case](https://en.wiktionary.org/wiki/title_case#English)'
|
|
|
|
## Considered Options
|
|
|
|
* [Regular expression](https://docs.oracle.com/javase/tutorial/essential/regex/)
|
|
* [OpenNLP](https://opennlp.apache.org/)
|
|
* [ICU4J](https://web.archive.org/web/20210413013221/http://site.icu-project.org/home)
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: "Regular expression", because we can use Java internal classes (Pattern, Matcher) instead of adding additional dependencies
|
|
|
|
### Positive Consequences
|
|
|
|
* Less dependencies on third party libraries
|
|
* Smaller project size (ICU4J is very large)
|
|
* No need for model data (OpenNLP is a machine learning based toolkit and needs a trained model to work properly)
|
|
|
|
### Negative Consequences
|
|
|
|
* Regular expressions can never cover every case, therefore, splitting may not be accurate for every title
|