| /* |
| Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, and |
| string width calculation for monospace fonts. Unicode Text Segmentation conforms |
| to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode |
| Line Breaking conforms to Unicode Standard Annex #14 |
| (https://unicode.org/reports/tr14/). |
| |
| In short, using this package, you can split a string into grapheme clusters |
| (what people would usually refer to as a "character"), into words, and into |
| sentences. Or, in its simplest case, this package allows you to count the number |
| of characters in a string, especially when it contains complex characters such |
| as emojis, combining characters, or characters from Asian, Arabic, Hebrew, or |
| other languages. Additionally, you can use it to implement line breaking (or |
| "word wrapping"), that is, to determine where text can be broken over to the |
| next line when the width of the line is not big enough to fit the entire text. |
| Finally, you can use it to calculate the display width of a string for monospace |
| fonts. |
| |
| # Getting Started |
| |
| If you just want to count the number of characters in a string, you can use |
| [GraphemeClusterCount]. If you want to determine the display width of a string, |
| you can use [StringWidth]. If you want to iterate over a string, you can use |
| [Step], [StepString], or the [Graphemes] class (more convenient but less |
| performant). This will provide you with all information: grapheme clusters, |
| word boundaries, sentence boundaries, line breaks, and monospace character |
| widths. The specialized functions [FirstGraphemeCluster], |
| [FirstGraphemeClusterInString], [FirstWord], [FirstWordInString], |
| [FirstSentence], and [FirstSentenceInString] can be used if only one type of |
| information is needed. |
| |
| # Grapheme Clusters |
| |
| Consider the rainbow flag emoji: 🏳️🌈. On most modern systems, it appears as one |
| character. But its string representation actually has 14 bytes, so counting |
| bytes (or using len("🏳️🌈")) will not work as expected. Counting runes won't, |
| either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function |
| utf8.RuneCountInString("🏳️🌈") and len([]rune("🏳️🌈")) will both return 4. |
| |
| The [GraphemeClusterCount] function will return 1 for the rainbow flag emoji. |
| The Graphemes class and a variety of functions in this package will allow you to |
| split strings into its grapheme clusters. |
| |
| # Word Boundaries |
| |
| Word boundaries are used in a number of different contexts. The most familiar |
| ones are selection (double-click mouse selection), cursor movement ("move to |
| next word" control-arrow keys), and the dialog option "Whole Word Search" for |
| search and replace. This package provides methods for determining word |
| boundaries. |
| |
| # Sentence Boundaries |
| |
| Sentence boundaries are often used for triple-click or some other method of |
| selecting or iterating through blocks of text that are larger than single words. |
| They are also used to determine whether words occur within the same sentence in |
| database queries. This package provides methods for determining sentence |
| boundaries. |
| |
| # Line Breaking |
| |
| Line breaking, also known as word wrapping, is the process of breaking a section |
| of text into lines such that it will fit in the available width of a page, |
| window or other display area. This package provides methods to determine the |
| positions in a string where a line must be broken, may be broken, or must not be |
| broken. |
| |
| # Monospace Width |
| |
| Monospace width, as referred to in this package, is the width of a string in a |
| monospace font. This is commonly used in terminal user interfaces or text |
| displays or editors that don't support proportional fonts. A width of 1 |
| corresponds to a single character cell. The C function [wcswidth()] and its |
| implementation in other programming languages is in widespread use for the same |
| purpose. However, there is no standard for the calculation of such widths, and |
| this package differs from wcswidth() in a number of ways, presumably to generate |
| more visually pleasing results. |
| |
| To start, we assume that every code point has a width of 1, with the following |
| exceptions: |
| |
| - Code points with grapheme cluster break properties Control, CR, LF, Extend, |
| and ZWJ have a width of 0. |
| - U+2E3A, Two-Em Dash, has a width of 3. |
| - U+2E3B, Three-Em Dash, has a width of 4. |
| - Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide" |
| (W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both |
| have a width of 1.) |
| - Code points with grapheme cluster break property Regional Indicator have a |
| width of 2. |
| - Code points with grapheme cluster break property Extended Pictographic have |
| a width of 2, unless their Emoji Presentation flag is "No", in which case |
| the width is 1. |
| |
| For Hangul grapheme clusters composed of conjoining Jamo and for Regional |
| Indicators (flags), all code points except the first one have a width of 0. For |
| grapheme clusters starting with an Extended Pictographic, any additional code |
| point will force a total width of 2, except if the Variation Selector-15 |
| (U+FE0E) is included, in which case the total width is always 1. Grapheme |
| clusters ending with Variation Selector-16 (U+FE0F) have a width of 2. |
| |
| Note that whether these widths appear correct depends on your application's |
| render engine, to which extent it conforms to the Unicode Standard, and its |
| choice of font. |
| |
| [wcswidth()]: https://man7.org/linux/man-pages/man3/wcswidth.3.html |
| */ |
| package uniseg |