Checking if a string is UTF-8 compatible for mySQL


We have older mySQL DB that only support UTF-8 charset. Is a there a way in Java to detect if a given string will be UTF-8 compatible?


public static boolean isUTF8MB4(String s) { for (int i = 0; i < s.length(); ++i) { int bytes = s.substring(i, i + 1).getBytes(StandardCharsets.UTF_8); if (bytes > 3) { return true; } } return false; }

The above implementation seems best, but otherwise:

public static boolean isUTF8MB4(String s) { for (int i = 0; i < s.length(); ) { int codePoint = s.codePointAt(i); int bytes = Character.charCount(codePoint); if (bytes > 3) { return true; } i += bytes; } return false; }

which might fail more often.


Every String is UTF-8 compatible. Just set encoding in the database and the MySQL driver correctly and you're set.

The only gotcha is that the length in bytes of the UTF-8 encoded string may be larger that what .length() says. <a href="https://stackoverflow.com/a/8512877/1648987" rel="nofollow">Here's a Java implementation of a function to measure how many bytes a string will take after encoding to UTF-8.</a>

EDIT: Since Saqib pointed out that older MySQL doesn't actually support UTF-8, but only its BMP subset, you can check if a string contains codepoints outside BMP with string.length()==string.codePointCount(0,string.length()) ("true" means "all codepoints are in BMP") and remove them with string.replaceAll("[^\u0000-\uffff]", "")


MySQL <a href="https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html" rel="nofollow">defines</a>:


The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters.


Therefore this function should work:

private boolean isValidUTF8(final String string) { for (int i = 0; i < string.length(); i++) { final char c = string.charAt(i); if (!Character.isBmpCodePoint(c)) { return false; } } return true; }


