44657

'std::wstring_convert' to convert as much as possible (from a UTF8 file-read chunk)

Question:

I am fetching text from a utf-8 text file, and doing it by chunks to increase performance.

std::ifstream.read(myChunkBuff_str, myChunkBuff_str.length())

<a href="https://andschwa.com/post/efficient-chunked-file-reading-in-c/" rel="nofollow">Here is a more detailed example</a>

I am getting around 16 thousand characters with each chunk. My next step is to convert this std::string into something that can allow me to work on these "complex characters" individually, thus converting that std::string into std::wstring.

I am using the following function for converting, <a href="https://stackoverflow.com/a/51356708/9007125" rel="nofollow">taken from here:</a>

#include <string> #include <codecvt> #include <locale> std::string narrow (const std::wstring& wide_string) { std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert; return convert.to_bytes (wide_string); } std::wstring widen (const std::string& utf8_string) { std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert; return convert.from_bytes (utf8_string); }

However, at its end of the chunk one of the Russian characters might be cut-off, and the conversion will fail, with an std::range_error exception.

For example, in UTF-8 "привет" takes 15 chars and "приве" takes 13 chars. So, if my chunk was hypothetically 14, the 'т' would be partially missing, and the conversion would throw exception.

<strong>Question:</strong>

How to detect these partially-loaded character? ('т' in this case) This would allow me to convert without it, and perhaps shift the next chunk a bit earlier than planned, to include this problematic 'т' next time?

I don't want to try or catch around these functions, as try/catch might slow me down the program. It also doesn't tell me "how much of character was missing for the conversion to actually succeed".

I know about wstring_convert::converted() but it's not really useful if my program crashes before I get to it

Answer1:

You could do this using a couple of functions. UTF-8 has a way to detect the beginning of a multibyte character and (from the beginning) the size of the multibyte character.

So two functions:

// returns zero if this is the first byte of a UTF-8 char // otherwise non-zero. static unsigned is_continuation(char c) { return (c & 0b10000000) && !(c & 0b01000000); } // if c is the *first* byte of a UTF-8 multibyte character, returns // the total number of bytes of the character. static unsigned size(const unsigned char c) { constexpr static const char u8char_size[] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 , 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 , 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3 , 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 0, 0 }; return u8char_size[(unsigned char)c]; }

You could track back from the end of your buffer until is_continuation(c) is <strong>false</strong>. Then check if size(c) of the current UTF-8 char is longer than the end of the buffer.

Disclaimer - last time I looked these functions were working but have not used them in a while.

<strong>Edit:</strong> to add.

If you feel like doing th whole thing manually I may as well post the code to convert a UTF-8 multibyte character to a UTF-16 multibyte or a UTF-32 char.

<strong>UTF-32</strong> Is easy:

// returns a UTF-32 char from a `UTF-8` multibyte // character pointed to by cp static char32_t char32(const char* cp) { auto sz = size(*cp); // function above if(sz == 1) return *cp; char32_t c32 = (0b01111111 >> sz) & (*cp); for(unsigned i = 1; i < sz; ++i) c32 = (c32 << 6) | (cp[i] & 0b0011'1111); return c32; }

<strong>UTF-16</strong> Is a little more tricky:

// UTF-16 characters can be 1 or 2 characters wide... using char16_pair = std::array<char16_t, 2>; // outputs a UTF-16 char in cp16 from a `UTF-8` multibyte // character pointed to by cp // // returns the number of characters in this `UTF-16` character // (1 or 2). static unsigned char16(const char* cp, char16_pair& cp16) { char32_t c32 = char32(cp); if(c32 < 0xD800 || (c32 > 0xDFFF && c32 < 0x10000)) { cp16[0] = char16_t(c32); cp16[1] = 0; return 1; } c32 -= 0x010000; cp16[0] = ((0b1111'1111'1100'0000'0000 & c32) >> 10) + 0xD800; cp16[1] = ((0b0000'0000'0011'1111'1111 & c32) >> 00) + 0xDC00; return 2; }

Recommend

  • How to convert wstring to wchar_t*? C++
  • How to convert UnicodeString to windows-1251 using ICU library in c++ Linux?
  • How to marshal collection in c# to pass to native (C++) code
  • How can I overload operator= for lambda assignments?
  • OpenProcess the handle is invalid. CloseHandle not work
  • ImageMagick Error: Unable to open image
  • libcurl crashes in a Windows service
  • Natural sort order in C++ using StrCmpLogicalW function in this library shlwapi.dll
  • Get unichar * from a C++ std::string& to create a nonnull NSString in Objective-C++
  • C++ Use Regex to find substring
  • Using semantic action together with attribute propagation in spirit
  • Conversion from char * to wchar* does not work properly
  • Why can't I have template and default arguments?
  • A couple more SWIG warnings
  • How do I change POPUP Text of Menu without ID
  • Why not convert all .properties files to UTF-8?
  • Speed up read.dbf in R (problems with importing large dbf file)
  • json string with utf16 char cannot convert from 'const char [566]' to 'std::basic_str
  • boost::filesystem::path for unicode file paths?
  • Seemingly empty vector
  • Why Apache POI XWPFRun.setFontFamily() for cyrillic works wrong?
  • How to wrap UTF-8 encoded C++ std::strings with Swig in C#?
  • Custom partiotioning of JavaDStreamPairRDD
  • Binding ContextMenu Tag to Owner
  • How can I get process name of specific PID with ps command in alpine
  • Connection pooling with URLConnection?
  • How to Translate texts contained in MsgBox in Inno Setup?
  • richtextbox to string
  • How to override jQuery promise callback
  • RavenDB indexing errors
  • Regex for nested values
  • how to solve invalid conversion specifier warning in iphone app
  • VBA Excel, loop through variables
  • converter json to two dimensional array
  • abstracting over a collection
  • SAXReader not re-ecape characters
  • javascript inside java/jsp code
  • Android Studio and gradle
  • IndexOutOfRangeException on multidimensional array despite using GetLength check
  • How can i traverse a binary tree from right to left in java?