R – Force character vector encoding from “unknown” to “UTF-8” in R

character encodingdata.tablerutf-8

I have a problem with inconsistent encoding of character vector in R.

The text file which I read a table from is encoded (via Notepad++) in UTF-8 (I tried with UTF-8 without BOM, too.).

I want to read table from this text file, convert it do data.table, set a key and make use of binary search. When I tried to do so, the following appeared:

Warning message:
In [.data.table(poli.dt, "żżonymi", mult = "first") :
A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support
mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and
others not. But if either latin1 or UTF-8 is used exclusively, and all
unknown encodings are ascii, then the result should be ok. In future
we will check for you and avoid this warning if everything is ok. The
tricky part is doing this without impacting performance for ascii-only
cases.

and binary search does not work.

I realised that my data.table–key column consists of both: "unknown" and "UTF-8" Encoding types:

> table(Encoding(poli.dt$word))
unknown   UTF-8 
2061312 2739122

I tried to convert this column (before creating a data.table object) with the use of:

Encoding(word) <- "UTF-8"
word<- enc2utf8(word)

but with no effect.

I also tried a few different ways of reading a file into R (setting all helpful parameters, e.g. encoding = "UTF-8"):

data.table::fread
utils::read.table
base::scan
colbycol::cbc.read.table

but with no effect.

==================================================

My R.version:

> R.version
           _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          0.3                         
year           2014                        
month          03                          
day            06                          
svn rev        65126                       
language       R                           
version.string R version 3.0.3 (2014-03-06)
nickname       Warm Puppy

My session info:

> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250                LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250    

base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.2 colbycol_0.8     filehash_2.2-2   rJava_0.9-6     

loaded via a namespace (and not attached):
[1] plyr_1.8.1     Rcpp_0.11.1    reshape2_1.2.2 stringr_0.6.2  tools_3.0.3

Best Answer

The Encoding function returns unknown if a character string has a "native encoding" mark (CP-1250 in your case) or if it's in ASCII. To discriminate between these two cases, call:

library(stringi)
stri_enc_mark(poli.dt$word)

To check whether each string is a valid UTF-8 byte sequence, call:

all(stri_enc_isutf8(poli.dt$word))

If it's not the case, your file is definitely not in UTF-8.

I suspect that you haven't forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word to verify this statement). If my guess is true, try:

read.csv2(file("filename", encoding="UTF-8"))

poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings

If data.table still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:

stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"

Related Solutions

Java – Setting the default Java character encoding

Unfortunately, the file.encoding property has to be specified as the JVM starts up; by the time your main method is entered, the character encoding used by String.getBytes() and the default constructors of InputStreamReader and OutputStreamWriter has been permanently cached.

As Edward Grech points out, in a special case like this, the environment variable JAVA_TOOL_OPTIONS can be used to specify this property, but it's normally done like this:

java -Dfile.encoding=UTF-8 … com.x.Main

Charset.defaultCharset() will reflect changes to the file.encoding property, but most of the code in the core Java libraries that need to determine the default character encoding do not use this mechanism.

When you are encoding or decoding, you can query the file.encoding property or Charset.defaultCharset() to find the current default encoding, and use the appropriate method or constructor overload to specify it.

==================================================

Best Answer

Related Solutions

Java – Setting the default Java character encoding

Related Topic