32783

Data.table, logical comparison and encoding bugs/errors in non-English environment

Question:

Data table gives a warning, even if encodings are not mixed and are known. The only time a merge doesn't give any warning is when the encoding is set to unknown on both of them. This doesn't seem to be right, logical comparisons seems to act differently and ignores encoding.

I have two questions, why does data-table have this behavior when both encodings are known and the same. I guess it's a bug on the basis of the warning (albeit a small one)?

The last merge, that fails is perhaps desired behavior, but shouldn't then the logical comparison also fail? Which brings me to the second question, what's the difference with a data.table join and a logical comparison since in my last merge they give different results?

Logical comparisons seems more robust in face of encoding issues.

Code and re-producable output below. sessionInfo() below that.

library("data.table") d.tst <- data.table(Nr = c("ÅÄÖ", "ÄÖR")) d.tst2 <- data.table(Nr2 = c("ÅÄÖ", "ÄÖR"), Dat = c(1, 2)) Encoding(d.tst$Nr) # [1] "latin1" "latin1" Encoding(d.tst2$Nr2) # [1] "latin1" "latin1" d.tst[1]$Nr == d.tst2[1]$Nr2 # [1] TRUE a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2") <blockquote>

Warning message: In bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support <em>mixed</em> encodings well; i.e., using both latin1 and UTF-8, or if any unknown<br /> encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.

</blockquote> d.tst$Nr <- iconv(d.tst$Nr, "LATIN1", "UTF-8") d.tst2$Nr2 <- iconv(d.tst2$Nr2, "LATIN1", "UTF-8") Encoding(d.tst$Nr) # [1] "UTF-8" "UTF-8" Encoding(d.tst2$Nr2) # [1] "UTF-8" "UTF-8" a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2") <blockquote>

Warning message: In bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,: A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support <em>mixed</em> encodings well; i.e., using both latin1 and UTF-8, or if any unknown<br /> encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.

</blockquote> d.tst$Nr <- iconv(d.tst$Nr, "UTF-8", "cp1252") d.tst2$Nr2 <- iconv(d.tst2$Nr2, "UTF-8", "cp1252") Encoding(d.tst$Nr) # [1] "unknown" "unknown" Encoding(d.tst2$Nr2) # [1] "unknown" "unknown" a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2") # Here we change the encoding on only one data.table d.tst$Nr <- iconv(d.tst$Nr, "cp1252", "UTF-8") #Check encoding Encoding(d.tst$Nr) # [1] "UTF-8" "UTF-8" Encoding(d.tst2$Nr2) # [1] "unknown" "unknown" # Logical comparison d.tst[1]$Nr == d.tst2[1]$Nr2 # [1] TRUE # This merge fails completely, not just a warning, even if logic says they are the same a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2") <blockquote>

Warning message: In bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support <em>mixed</em> encodings well; i.e., using both latin1 and UTF-8, or if any unknown<br /> encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.

</blockquote> sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=Swedish_Sweden.1252 LC_CTYPE=Swedish_Sweden.1252 LC_MONETARY=Swedish_Sweden.1252 LC_NUMERIC=C [5] LC_TIME=Swedish_Sweden.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.9.6 RODBC_1.3-13 loaded via a namespace (and not attached): [1] magrittr_1.5 R6_2.1.2 assertthat_0.1 DBI_0.4-1 tools_3.3.1 tibble_1.1 Rcpp_0.12.5 chron_2.3-47

Answer1:

As of the new data.table version 1.9.8 this should be fixed.

For example:

# This merge fails completely, not just a warning, even if logic says they are the same a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")

The above code failed for me (given my sys-settings) in 1.9.6. As of 1.9.8 it works as it should.

So this should be solved now.

Recommend

  • Why won't this work as an IIF function, but will as an IF Statement?
  • C++ std::vector::size() changes its state
  • Why does `if None.__eq__(“a”)` seem to evaluate to True (but not quite)?
  • how equals method works in java
  • MySQL convert charset issue
  • SQL: what kind of relation (1:1, 1:m, m:m,…) there is between this two tables?
  • SQL not inserting into table with relation in Yii
  • Code looking for modules in the wrong place
  • What is the difference between running in VS 2010 and running a builded EXE?
  • Number of variables doesn't match number of parameters - Yes they do
  • Converting a data frame into named object in R
  • Referring to individual variables in … with dplyr quos
  • Convert data type in R or Python
  • How solve “Qt: Untested Windows version 10.0 detected!”
  • C: Incompatible pointer type initializing
  • NHibernate Validation Localization with S#arp Architecture
  • How can I send an e-mail from a vbs script
  • Encrypt data by using a public key in c# and decrypt data by using a private key in php
  • jQuery show() function is not executed in Safari if submit handler returns true
  • Align navbar back button on right side
  • HTML download movie download link
  • Accessing IRQ description array within a module and displaying action names
  • Modifying destination and filename of gulp-svg-sprite
  • SSO with signing and signature validation doesn't work
  • Deserializing XML into class C#
  • How to handle AllServersUnavailable Exception
  • How to model a transition system with SPIN
  • VBA Convert delimiter text file to Excel
  • ORA-29908: missing primary invocation for ancillary operator
  • Function pointer “assignment from incompatible pointer type” only when using vararg ellipsis
  • Run Powershell script from inside other Powershell script with dynamic redirection to file
  • How to disable jQuery.jplayer autoplay?
  • Getting Messege Twice Using IMvxMessenger
  • How to stop GridView from loading again when I press back button?
  • Bitwise OR returns boolean when one of operands is nil
  • sending mail using smtp is too slow
  • Busy indicator not showing up in wpf window [duplicate]
  • costura.fody for a dll that references another dll
  • Why is Django giving me: 'first_name' is an invalid keyword argument for this function?
  • How can I use `wmic` in a Windows PE script?