escape()
Don't use it!
escape()
is defined in section B.2.1.2 escape and the introduction text of Annex B says:
... All of the language features and behaviours specified in this annex have one or more undesirable characteristics and in the absence of legacy usage would be removed from this specification. ...
... Programmers should not use or assume the existence of these features and behaviours when writing new ECMAScript code....
Behaviour:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape
Special characters are encoded with the exception of: @*_+-./
The hexadecimal form for characters, whose code unit value is 0xFF or less, is a two-digit escape sequence: %xx
.
For characters with a greater code unit, the four-digit format %uxxxx
is used. This is not allowed within a query string (as defined in RFC3986):
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
A percent sign is only allowed if it is directly followed by two hexdigits, percent followed by u
is not allowed.
encodeURI()
Use encodeURI when you want a working URL. Make this call:
encodeURI("http://www.example.org/a file with spaces.html")
to get:
http://www.example.org/a%20file%20with%20spaces.html
Don't call encodeURIComponent since it would destroy the URL and return
http%3A%2F%2Fwww.example.org%2Fa%20file%20with%20spaces.html
Note that encodeURI, like encodeURIComponent, does not escape the ' character.
encodeURIComponent()
Use encodeURIComponent when you want to encode the value of a URL parameter.
var p1 = encodeURIComponent("http://example.org/?a=12&b=55")
Then you may create the URL you need:
var url = "http://example.net/?param1=" + p1 + "¶m2=99";
And you will get this complete URL:
http://example.net/?param1=http%3A%2F%2Fexample.org%2F%Ffa%3D12%26b%3D55¶m2=99
Note that encodeURIComponent does not escape the '
character. A common bug is to use it to create html attributes such as href='MyUrl'
, which could suffer an injection bug. If you are constructing html from strings, either use "
instead of '
for attribute quotes, or add an extra layer of encoding ('
can be encoded as %27).
For more information on this type of encoding you can check: http://en.wikipedia.org/wiki/Percent-encoding
Rather than mess with the encode and decode methods I find it easier to specify the encoding when opening the file. The io
module (added in Python 2.6) provides an io.open
function, which has an encoding parameter.
Use the open method from the io
module.
>>>import io
>>>f = io.open("test", mode="r", encoding="utf-8")
Then after calling f's read() function, an encoded Unicode object is returned.
>>>f.read()
u'Capit\xe1l\n\n'
Note that in Python 3, the io.open
function is an alias for the built-in open
function. The built-in open function only supports the encoding argument in Python 3, not Python 2.
Edit: Previously this answer recommended the codecs module. The codecs module can cause problems when mixing read()
and readline()
, so this answer now recommends the io module instead.
Use the open method from the codecs module.
>>>import codecs
>>>f = codecs.open("test", "r", "utf-8")
Then after calling f's read() function, an encoded Unicode object is returned.
>>>f.read()
u'Capit\xe1l\n\n'
If you know the encoding of a file, using the codecs package is going to be much less confusing.
See http://docs.python.org/library/codecs.html#codecs.open
Best Answer
It's a
’
(RIGHT SINGLE QUOTATION MARK
- U+2019) character which is being decoded as CP-1252 instead of UTF-8. If you check the encodings table, then you see that this character is in UTF-8 composed of bytes0xE2
,0x80
and0x99
. If you check the CP-1252 code page layout, then you'll see that each of those bytes stand for the individual charactersâ
,€
and™
.Use UTF-8 instead of CP-1252 to read, write, store, and display the characters.
This only instructs the client which encoding to use to interpret and display the characters. This doesn't instruct your own program which encoding to use to read, write, store, and display the characters in. The exact answer depends on the server side platform / database / programming language used. Do note that the one set in HTTP response header has precedence over the HTML meta tag. The HTML meta tag would only be used when the page is opened from local disk file system instead of from HTTP.
This only forces the client which encoding to use to interpret and display the characters. But the actual problem is that you're already sending
’
(encoded in UTF-8) to the client instead of’
. The client is correctly displaying’
using the UTF-8 encoding. If the client was misinstructed to use, for example ISO-8859-1, you would likely have seenââ¬â¢
instead.This is most likely where your problem lies. You need to verify with an independent database tool what the data looks like.
If the
’
character is there, then you aren't connecting to the database correctly. You need to tell the database connector to use UTF-8.If your database contains
’
, then it's your database that's messed up. Most probably the tables aren't configured to useUTF-8
. Instead, they use the database's default encoding, which varies depending on the configuration. If this is your issue, then usually just altering the table to use UTF-8 is sufficient. If your database doesn't support that, you'll need to recreate the tables. It is good practice to set the encoding of the table when you create it.You're most likely using SQL Server, but here is some MySQL code (copied from this article):
If your table is however already UTF-8, then you need to take a step back. Who or what put the data there. That's where the problem is. One example would be HTML form submitted values which are incorrectly encoded/decoded.
Here are some more links to learn more about the problem: