81612

Script with utf-8 text runs differently from RStudio and command line in Windows

Question:

I'm working with files containing text in Hindi and parsing them. I wrote my code in Rstudio and executed it without many issues. But now, I need to execute the same script from command line using R.exe/Rscript.exe and it doesn't work the same way. I've run a simple script from both RStudio and the terminal:

n_p<-'नाम' Encoding(n_p) gregexpr(n_p,c('adfdafc','नाम adsfdfa')) sessionInfo()

Output In RStudio:

> n_p<-'नाम' > > Encoding(n_p) [1] "UTF-8" > > gregexpr(n_p,c('adfdafc','नाम adsfdfa')) [[1]] [1] -1 attr(,"match.length") [1] -1 [[2]] [1] 1 attr(,"match.length") [1] 3 > sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7600) Matrix products: default locale: [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252 [3] LC_MONETARY=English_India.1252 LC_NUMERIC=C [5] LC_TIME=English_India.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rJava_0.9-10 loaded via a namespace (and not attached): [1] compiler_3.5.0 tools_3.5.0

Output with R.exe in cmd (For debugging purposes. Rscript.exe gives a similar if not identical output)

> n_p<-'à☼"à☼_à☼r' > > Encoding(n_p) [1] "latin1" > > gregexpr(n_p,c('adfdafc','à☼"à☼_à☼r adsfdfa')) [[1]] [1] -1 attr(,"match.length") [1] -1 [[2]] [1] 1 attr(,"match.length") [1] 9 > sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7600) Matrix products: default locale: [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252 [3] LC_MONETARY=English_India.1252 LC_NUMERIC=C [5] LC_TIME=English_India.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.5.0

I've tried changing locales but Sys.setlocale refuses to work properly. In some cases, gregexpr gives an error when it can't parse non ASCII code. And finally, when it does run without errors, it doesn't match regular expressions properly. I can't provide a reproducible example at the moment, but I will try to later.

Help.

Answer1:

You need to ensure that R is running in a suitable locale:

Running rterm use: Sys.getlocale() to find your current locale.

You can set your locale using:

Sys.setlocale(category = "LC_ALL", locale = "hi-IN") # Try "hi-IN.UTF-8" too...

You can find locale names <a href="https://www.science.co.il/language/Locale-codes.php" rel="nofollow">here</a>, the <a href="https://docs.microsoft.com/en-us/cpp/c-runtime-library/language-strings" rel="nofollow">MSDN</a>, and <a href="https://ss64.com/locale.html" rel="nofollow">here</a>.

If you have the correct value, put the Sys.setlocale() command in your ~/.Rprofile.

<strong>References</strong>

<ul><li><a href="https://cran.r-project.org/bin/windows/base/rw-FAQ.html" rel="nofollow">https://cran.r-project.org/bin/windows/base/rw-FAQ.html</a></li> <li><a href="http://withr.me/configure-character-encoding-for-r-under-linux-and-windows/" rel="nofollow">http://withr.me/configure-character-encoding-for-r-under-linux-and-windows/</a></li> </ul>

Recommend

  • Can't get bookdown rmd_subdir [“dir”] to work
  • How to link a simple libssh program
  • vim-ipython failed on Windows 7
  • Reading a UTF-8 text file (in Hebrew) shows gibrish in RStudio's console and fine in RGUI
  • Static linking libgcc on Windows DLL
  • How do I use R to download LEHD data from the website?
  • C++:BOOST-bind error: no matching function for call to 'bind(, …?
  • g++: error: CreateProcess no such file or directory
  • Visual Studio 2017 compatibality with boost 1.64.0/1.63.0 issue
  • overhead of reserving address space using mmap
  • SSLException: Connection has been shutdown: javax.net.ssl.SSLException: Tag mismatch
  • MongoDB GeoJSON “Can't extract geo keys from object, malformed geometry?” when inserting type P
  • Pointer vs Reference difference when passing Eigen objects as arguments
  • Not able to display correct data in table -AngularJS
  • quiver not drawing arrows just lots of blue, matlab
  • How to run “Deployd” on port 80 instead of port 5000 in webserver.
  • preg_replace Double Spaces to tab (\\t) at the beginning of a line
  • Allowing both email and username for authentication
  • Extracting HTML between tags
  • MongoDB in PHP using aggregate to group by _id is null not working
  • Get one-time binding to work for ng-if
  • Installing Hadoop, Java Exception about illegal characters at index 7?
  • Why HTML5 Canvas with a larger size stretch a drawn line?
  • Sony Xperia Z Tablet not found by adb
  • Javascript convert timezone issue
  • Convert array of 8 bytes to signed long in C++
  • ActionScript 2 vs ActionScript 3 performance
  • Why is the timeout on a windows udp receive socket always 500ms longer than set by SO_RCVTIMEO?
  • How do I use the BLAS library provided by MATLAB?
  • How do you troubleshoot character encoding problems?
  • How can I estimate amount of memory left with calling System.gc()?
  • Apache 2.4 - remove | delete | uninstall
  • retrieve vertices with no linked edge in arangodb
  • How do you join a server to an Active Directory (domain)?
  • -fvisibility=hidden not passed by compiler for Debug builds
  • FormattedException instead of throw new Exception(string.Format(…)) in .NET
  • Change div Background jquery
  • apache spark aggregate function using min value
  • Django query for large number of relationships
  • Net Present Value in Excel for Grouped Recurring CF