Removing non-ASCII characters from data files

asciinon-ascii-charactersrunicode

I've got a bunch of csv files that I'm reading into R and including in a package/data folder in .rdata format. Unfortunately the non-ASCII characters in the data fail the check. The tools package has two functions to check for non-ASCII characters (showNonASCII and showNonASCIIfile) but I can't seem to locate one to remove/clean them.

Before I explore other UNIX tools, it would be great to do this all in R so I can maintain a complete workflow from raw data to final product. Are there any existing packages/functions to help me get rid of the non-ASCII characters?

Best Answer

These days, a slightly better approach is to use the stringi package which provides a function for general unicode conversion. This allows you to preserve the original text as much as possible:

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
x
#> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

stringi::stri_trans_general(x, "latin-ascii")
#> [1] "Ekstrom"          "Joreskog"         "bisschen Zurcher"

Related Solutions

C# – How to you strip non-ASCII characters from a string? (in C#)

string s = "søme string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);

Regex – (grep) Regex to match non-ASCII characters

This will match a single non-ASCII character:

[^\x00-\x7F]

This is a valid PCRE (Perl-Compatible Regular Expression).

You can also use the POSIX shorthands:

[[:ascii:]] - matches a single ASCII char
[^[:ascii:]] - matches a single non-ASCII char

[^[:print:]] will probably suffice for you.**

Best Answer

Related Solutions

C# – How to you strip non-ASCII characters from a string? (in C#)

Regex – (grep) Regex to match non-ASCII characters

Related Topic