
Question:
We have older mySQL DB that only support UTF-8 charset. Is a there a way in Java to detect if a given string will be UTF-8 compatible?
Answer1:
public static boolean isUTF8MB4(String s) {
for (int i = 0; i < s.length(); ++i) {
int bytes = s.substring(i, i + 1).getBytes(StandardCharsets.UTF_8);
if (bytes > 3) {
return true;
}
}
return false;
}
The above implementation seems best, but otherwise:
public static boolean isUTF8MB4(String s) {
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
int bytes = Character.charCount(codePoint);
if (bytes > 3) {
return true;
}
i += bytes;
}
return false;
}
which might fail more often.
Answer2:Every String is UTF-8 compatible. Just set encoding in the database and the MySQL driver correctly and you're set.
The only gotcha is that the length in bytes of the UTF-8 encoded string may be larger that what .length()
says. <a href="https://stackoverflow.com/a/8512877/1648987" rel="nofollow">Here's a Java implementation of a function to measure how many bytes a string will take after encoding to UTF-8.</a>
EDIT: Since Saqib pointed out that older MySQL doesn't actually support UTF-8, but only its BMP subset, you can check if a string contains codepoints outside BMP with string.length()==string.codePointCount(0,string.length())
("true" means "all codepoints are in BMP") and remove them with string.replaceAll("[^\u0000-\uffff]", "")
MySQL <a href="https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html" rel="nofollow">defines</a>:
<blockquote>The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters.
</blockquote>Therefore this function should work:
private boolean isValidUTF8(final String string) {
for (int i = 0; i < string.length(); i++) {
final char c = string.charAt(i);
if (!Character.isBmpCodePoint(c)) {
return false;
}
}
return true;
}