What is the easiest way to match non-ASCII characters in a regex? I would like to match all words individually in an input string, but the language may not be English, so I will need to match things like ü, ö, ß, and ñ. Also, this is in Javascript/jQuery, so any solution will need to apply to that.
Javascript – Regular expression to match non-ASCII characters
javascriptjqueryregex
Related Solutions
Native deep cloning
It's called "structured cloning", works experimentally in Node 11 and later, and hopefully will land in browsers. See this answer for more details.
Fast cloning with data loss - JSON.parse/stringify
If you do not use Date
s, functions, undefined
, Infinity
, RegExps, Maps, Sets, Blobs, FileLists, ImageDatas, sparse Arrays, Typed Arrays or other complex types within your object, a very simple one liner to deep clone an object is:
JSON.parse(JSON.stringify(object))
const a = {
string: 'string',
number: 123,
bool: false,
nul: null,
date: new Date(), // stringified
undef: undefined, // lost
inf: Infinity, // forced to 'null'
re: /.*/, // lost
}
console.log(a);
console.log(typeof a.date); // Date object
const clone = JSON.parse(JSON.stringify(a));
console.log(clone);
console.log(typeof clone.date); // result of .toISOString()
See Corban's answer for benchmarks.
Reliable cloning using a library
Since cloning objects is not trivial (complex types, circular references, function etc.), most major libraries provide function to clone objects. Don't reinvent the wheel - if you're already using a library, check if it has an object cloning function. For example,
- lodash -
cloneDeep
; can be imported separately via the lodash.clonedeep module and is probably your best choice if you're not already using a library that provides a deep cloning function - AngularJS -
angular.copy
- jQuery -
jQuery.extend(true, { }, oldObject)
;.clone()
only clones DOM elements - just library -
just-clone
; Part of a library of zero-dependency npm modules that do just do one thing. Guilt-free utilities for every occasion.
ES6 (shallow copy)
For completeness, note that ES6 offers two shallow copy mechanisms: Object.assign()
and the spread syntax.
which copies values of all enumerable own properties from one object to another. For example:
var A1 = {a: "2"};
var A2 = Object.assign({}, A1);
var A3 = {...A1}; // Spread Syntax
The fully RFC 822 compliant regex is inefficient and obscure because of its length. Fortunately, RFC 822 was superseded twice and the current specification for email addresses is RFC 5322. RFC 5322 leads to a regex that can be understood if studied for a few minutes and is efficient enough for actual use.
One RFC 5322 compliant regex can be found at the top of the page at http://emailregex.com/ but uses the IP address pattern that is floating around the internet with a bug that allows 00
for any of the unsigned byte decimal values in a dot-delimited address, which is illegal. The rest of it appears to be consistent with the RFC 5322 grammar and passes several tests using grep -Po
, including cases domain names, IP addresses, bad ones, and account names with and without quotes.
Correcting the 00
bug in the IP pattern, we obtain a working and fairly fast regex. (Scrape the rendered version, not the markdown, for actual code.)
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
or:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Here is diagram of finite state machine for above regexp which is more clear than regexp itself
The more sophisticated patterns in Perl and PCRE (regex library used e.g. in PHP) can correctly parse RFC 5322 without a hitch. Python and C# can do that too, but they use a different syntax from those first two. However, if you are forced to use one of the many less powerful pattern-matching languages, then it’s best to use a real parser.
It's also important to understand that validating it per the RFC tells you absolutely nothing about whether that address actually exists at the supplied domain, or whether the person entering the address is its true owner. People sign others up to mailing lists this way all the time. Fixing that requires a fancier kind of validation that involves sending that address a message that includes a confirmation token meant to be entered on the same web page as was the address.
Confirmation tokens are the only way to know you got the address of the person entering it. This is why most mailing lists now use that mechanism to confirm sign-ups. After all, anybody can put down president@whitehouse.gov
, and that will even parse as legal, but it isn't likely to be the person at the other end.
For PHP, you should not use the pattern given in Validate an E-Mail Address with PHP, the Right Way from which I quote:
There is some danger that common usage and widespread sloppy coding will establish a de facto standard for e-mail addresses that is more restrictive than the recorded formal standard.
That is no better than all the other non-RFC patterns. It isn’t even smart enough to handle even RFC 822, let alone RFC 5322. This one, however, is.
If you want to get fancy and pedantic, implement a complete state engine. A regular expression can only act as a rudimentary filter. The problem with regular expressions is that telling someone that their perfectly valid e-mail address is invalid (a false positive) because your regular expression can't handle it is just rude and impolite from the user's perspective. A state engine for the purpose can both validate and even correct e-mail addresses that would otherwise be considered invalid as it disassembles the e-mail address according to each RFC. This allows for a potentially more pleasing experience, like
The specified e-mail address 'myemail@address,com' is invalid. Did you mean 'myemail@address.com'?
See also Validating Email Addresses, including the comments. Or Comparing E-mail Address Validating Regular Expressions.
Related Topic
- Regex – Regular expression to match a line that doesn’t contain a word
- Javascript – How to access the matched groups in a JavaScript regular expression
- Javascript – How to use a variable in a regular expression
- Javascript – Detecting an “invalid date” Date instance in JavaScript
- Javascript – Strip all non-numeric characters from string in JavaScript
- Regex – a non-capturing group in regular expressions
- Javascript – How to decide when to use Node.js
Best Answer
This should do it:
It matches any character which is not contained in the ASCII character set (0-127, i.e. 0x0 to 0x7F).
You can do the same thing with Unicode:
For unicode you can look at this 2 resources: