| Abhay Kumar | a61c522 | 2025-11-10 07:32:50 +0000 | [diff] [blame] | 1 | /* |
| 2 | Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, and |
| 3 | string width calculation for monospace fonts. Unicode Text Segmentation conforms |
| 4 | to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode |
| 5 | Line Breaking conforms to Unicode Standard Annex #14 |
| 6 | (https://unicode.org/reports/tr14/). |
| 7 | |
| 8 | In short, using this package, you can split a string into grapheme clusters |
| 9 | (what people would usually refer to as a "character"), into words, and into |
| 10 | sentences. Or, in its simplest case, this package allows you to count the number |
| 11 | of characters in a string, especially when it contains complex characters such |
| 12 | as emojis, combining characters, or characters from Asian, Arabic, Hebrew, or |
| 13 | other languages. Additionally, you can use it to implement line breaking (or |
| 14 | "word wrapping"), that is, to determine where text can be broken over to the |
| 15 | next line when the width of the line is not big enough to fit the entire text. |
| 16 | Finally, you can use it to calculate the display width of a string for monospace |
| 17 | fonts. |
| 18 | |
| 19 | # Getting Started |
| 20 | |
| 21 | If you just want to count the number of characters in a string, you can use |
| 22 | [GraphemeClusterCount]. If you want to determine the display width of a string, |
| 23 | you can use [StringWidth]. If you want to iterate over a string, you can use |
| 24 | [Step], [StepString], or the [Graphemes] class (more convenient but less |
| 25 | performant). This will provide you with all information: grapheme clusters, |
| 26 | word boundaries, sentence boundaries, line breaks, and monospace character |
| 27 | widths. The specialized functions [FirstGraphemeCluster], |
| 28 | [FirstGraphemeClusterInString], [FirstWord], [FirstWordInString], |
| 29 | [FirstSentence], and [FirstSentenceInString] can be used if only one type of |
| 30 | information is needed. |
| 31 | |
| 32 | # Grapheme Clusters |
| 33 | |
| 34 | Consider the rainbow flag emoji: 🏳️🌈. On most modern systems, it appears as one |
| 35 | character. But its string representation actually has 14 bytes, so counting |
| 36 | bytes (or using len("🏳️🌈")) will not work as expected. Counting runes won't, |
| 37 | either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function |
| 38 | utf8.RuneCountInString("🏳️🌈") and len([]rune("🏳️🌈")) will both return 4. |
| 39 | |
| 40 | The [GraphemeClusterCount] function will return 1 for the rainbow flag emoji. |
| 41 | The Graphemes class and a variety of functions in this package will allow you to |
| 42 | split strings into its grapheme clusters. |
| 43 | |
| 44 | # Word Boundaries |
| 45 | |
| 46 | Word boundaries are used in a number of different contexts. The most familiar |
| 47 | ones are selection (double-click mouse selection), cursor movement ("move to |
| 48 | next word" control-arrow keys), and the dialog option "Whole Word Search" for |
| 49 | search and replace. This package provides methods for determining word |
| 50 | boundaries. |
| 51 | |
| 52 | # Sentence Boundaries |
| 53 | |
| 54 | Sentence boundaries are often used for triple-click or some other method of |
| 55 | selecting or iterating through blocks of text that are larger than single words. |
| 56 | They are also used to determine whether words occur within the same sentence in |
| 57 | database queries. This package provides methods for determining sentence |
| 58 | boundaries. |
| 59 | |
| 60 | # Line Breaking |
| 61 | |
| 62 | Line breaking, also known as word wrapping, is the process of breaking a section |
| 63 | of text into lines such that it will fit in the available width of a page, |
| 64 | window or other display area. This package provides methods to determine the |
| 65 | positions in a string where a line must be broken, may be broken, or must not be |
| 66 | broken. |
| 67 | |
| 68 | # Monospace Width |
| 69 | |
| 70 | Monospace width, as referred to in this package, is the width of a string in a |
| 71 | monospace font. This is commonly used in terminal user interfaces or text |
| 72 | displays or editors that don't support proportional fonts. A width of 1 |
| 73 | corresponds to a single character cell. The C function [wcswidth()] and its |
| 74 | implementation in other programming languages is in widespread use for the same |
| 75 | purpose. However, there is no standard for the calculation of such widths, and |
| 76 | this package differs from wcswidth() in a number of ways, presumably to generate |
| 77 | more visually pleasing results. |
| 78 | |
| 79 | To start, we assume that every code point has a width of 1, with the following |
| 80 | exceptions: |
| 81 | |
| 82 | - Code points with grapheme cluster break properties Control, CR, LF, Extend, |
| 83 | and ZWJ have a width of 0. |
| 84 | - U+2E3A, Two-Em Dash, has a width of 3. |
| 85 | - U+2E3B, Three-Em Dash, has a width of 4. |
| 86 | - Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide" |
| 87 | (W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both |
| 88 | have a width of 1.) |
| 89 | - Code points with grapheme cluster break property Regional Indicator have a |
| 90 | width of 2. |
| 91 | - Code points with grapheme cluster break property Extended Pictographic have |
| 92 | a width of 2, unless their Emoji Presentation flag is "No", in which case |
| 93 | the width is 1. |
| 94 | |
| 95 | For Hangul grapheme clusters composed of conjoining Jamo and for Regional |
| 96 | Indicators (flags), all code points except the first one have a width of 0. For |
| 97 | grapheme clusters starting with an Extended Pictographic, any additional code |
| 98 | point will force a total width of 2, except if the Variation Selector-15 |
| 99 | (U+FE0E) is included, in which case the total width is always 1. Grapheme |
| 100 | clusters ending with Variation Selector-16 (U+FE0F) have a width of 2. |
| 101 | |
| 102 | Note that whether these widths appear correct depends on your application's |
| 103 | render engine, to which extent it conforms to the Unicode Standard, and its |
| 104 | choice of font. |
| 105 | |
| 106 | [wcswidth()]: https://man7.org/linux/man-pages/man3/wcswidth.3.html |
| 107 | */ |
| 108 | package uniseg |