Golang : Normalize unicode strings for comparison purpose
Here is a tutorial on of how to normalize unicode strings and do proper comparison of unicode strings. Suggest that you at least read https://blog.golang.org/normalization once before reading this tutorial further to have a better grasp on the problem.
Normalization of unicode strings plays an important role in ensuring your program will process the user input properly. It is also useful in performing sanity checks on incoming data to ensure the underlying representation matched before performing further strings manipulation. Having different representations can cause your program to produce inaccurate strings comparison result.
In this example, comparing two non-normalized unicode strings will result in mis-match. Not because of their length difference, but because of the strings' underlying representation does not match each other. Normalizing the unicode strings with transform.Chain()
functions will create new strings with matching underlying representation.
package main
import (
"fmt"
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
"strings"
"unicode"
)
func isMn(r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
func main() {
str1 := "ElNi\u00f1o"
str2 := "ElNin\u0303o"
fmt.Printf("%s length is %d \n", str1, len(str1))
fmt.Printf("%s length is %d \n", str2, len(str2))
match := strings.EqualFold(str1, str2)
fmt.Println(match)
fmt.Println("Normalizing unicode strings....")
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
normStr1, _, _ := transform.String(t, str1)
fmt.Printf("%s length is %d \n", normStr1, len(str1))
normStr2, _, _ := transform.String(t, str2)
fmt.Printf("%s length is %d \n", normStr2, len(str2))
match2 := strings.EqualFold(normStr1, normStr2)
fmt.Println(match2)
}
Output:
ElNiño length is 7
ElNiño length is 8
false
Normalizing unicode strings....
ElNino length is 7
ElNino length is 8
true
Happy coding!
NOTES: Why the length is not the same even after the strings are normalized? Because \u00f1o
is composed of a ñ
character and \u0303o
is composed of n
and ~
characters.
References:
https://socketloop.com/tutorials/golang-strings-comparison
https://blog.golang.org/normalization
http://stackoverflow.com/questions/26722450/remove-diacritics-using-go
See also : Golang : Strings comparison
By Adam Ng
IF you gain some knowledge or the information here solved your programming problem. Please consider donating to the less fortunate or some charities that you like. Apart from donation, planting trees, volunteering or reducing your carbon footprint will be great too.
Advertisement
Tutorials
+5.5k Golang : How to deal with configuration data?
+20.6k Golang : Check if os.Stdin input data is piped or from terminal
+17.7k Golang : Find smallest number in array
+9.4k nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
+20.9k Golang : Read directory content with os.Open
+22.4k Golang : How to run Golang application such as web server in the background or as daemon?
+21.9k Golang : Encrypt and decrypt data with TripleDES
+12.6k Elastic Search : Return all records (higher than default 10)
+9.8k Golang : Populate slice with sequential integers example
+14.8k Golang : How to check if your program is running in a terminal
+9.1k Golang : Find network service name from given port and protocol
+35.4k Golang : Upload and download file to/from AWS S3