First and foremost, the "pass by value vs. pass by reference" distinction as defined in the CS theory is now obsolete because the technique originally defined as "pass by reference" has since fallen out of favor and is seldom used now.1
Newer languages2 tend to use a different (but similar) pair of techniques to achieve the same effects (see below) which is the primary source of confusion.
A secondary source of confusion is the fact that in "pass by reference", "reference" has a narrower meaning than the general term "reference" (because the phrase predates it).
Now, the authentic definition is:
When a parameter is passed by reference, the caller and the callee use the same variable for the parameter. If the callee modifies the parameter variable, the effect is visible to the caller's variable.
When a parameter is passed by value, the caller and callee have two independent variables with the same value. If the callee modifies the parameter variable, the effect is not visible to the caller.
Things to note in this definition are:
"Variable" here means the caller's (local or global) variable itself -- i.e. if I pass a local variable by reference and assign to it, I'll change the caller's variable itself, not e.g. whatever it is pointing to if it's a pointer.
- This is now considered bad practice (as an implicit dependency). As such, virtually all newer languages are exclusively, or almost exclusively pass-by-value. Pass-by-reference is now chiefly used in the form of "output/inout arguments" in languages where a function cannot return more than one value.
The meaning of "reference" in "pass by reference". The difference with the general "reference" term is that this "reference" is temporary and implicit. What the callee basically gets is a "variable" that is somehow "the same" as the original one. How specifically this effect is achieved is irrelevant (e.g. the language may also expose some implementation details -- addresses, pointers, dereferencing -- this is all irrelevant; if the net effect is this, it's pass-by-reference).
Now, in modern languages, variables tend to be of "reference types" (another concept invented later than "pass by reference" and inspired by it), i.e. the actual object data is stored separately somewhere (usually, on the heap), and only "references" to it are ever held in variables and passed as parameters.3
Passing such a reference falls under pass-by-value because a variable's value is technically the reference itself, not the referred object. However, the net effect on the program can be the same as either pass-by-value or pass-by-reference:
- If a reference is just taken from a caller's variable and passed as an argument, this has the same effect as pass-by-reference: if the referred object is mutated in the callee, the caller will see the change.
- However, if a variable holding this reference is reassigned, it will stop pointing to that object, so any further operations on this variable will instead affect whatever it is pointing to now.
- To have the same effect as pass-by-value, a copy of the object is made at some point. Options include:
- The caller can just make a private copy before the call and give the callee a reference to that instead.
- In some languages, some object types are "immutable": any operation on them that seems to alter the value actually creates a completely new object without affecting the original one. So, passing an object of such a type as an argument always has the effect of pass-by-value: a copy for the callee will be made automatically if and when it needs a change, and the caller's object will never be affected.
- In functional languages, all objects are immutable.
As you may see, this pair of techniques is almost the same as those in the definition, only with a level of indirection: just replace "variable" with "referenced object".
There's no agreed-upon name for them, which leads to contorted explanations like "call by value where the value is a reference". In 1975, Barbara Liskov suggested the term "call-by-object-sharing" (or sometimes just "call-by-sharing") though it never quite caught on. Moreover, neither of these phrases draws a parallel with the original pair. No wonder the old terms ended up being reused in the absence of anything better, leading to confusion.4
NOTE: For a long time, this answer used to say:
Say I want to share a web page with you. If I tell you the URL, I'm
passing by reference. You can use that URL to see the same web page I
can see. If that page is changed, we both see the changes. If you
delete the URL, all you're doing is destroying your reference to that
page - you're not deleting the actual page itself.
If I print out the page and give you the printout, I'm passing by
value. Your page is a disconnected copy of the original. You won't see
any subsequent changes, and any changes that you make (e.g. scribbling
on your printout) will not show up on the original page. If you
destroy the printout, you have actually destroyed your copy of the
object - but the original web page remains intact.
This is mostly correct except the narrower meaning of "reference" -- it being both temporary and implicit (it doesn't have to, but being explicit and/or persistent are additional features, not a part of the pass-by-reference semantic, as explained above). A closer analogy would be giving you a copy of a document vs inviting you to work on the original.
1Unless you are programming in Fortran or Visual Basic, it's not the default behavior, and in most languages in modern use, true call-by-reference is not even possible.
2A fair amount of older ones support it, too
3In several modern languages, all types are reference types. This approach was pioneered by the language CLU in 1975 and has since been adopted by many other languages, including Python and Ruby. And many more languages use a hybrid approach, where some types are "value types" and others are "reference types" -- among them are C#, Java, and JavaScript.
4There's nothing bad with recycling a fitting old term per se, but one has to somehow make it clear which meaning is used each time. Not doing that is exactly what keeps causing confusion.
string
? wstring
?
std::string
is a basic_string
templated on a char
, and std::wstring
on a wchar_t
.
char
vs. wchar_t
char
is supposed to hold a character, usually an 8-bit character.
wchar_t
is supposed to hold a wide character, and then, things get tricky:
On Linux, a wchar_t
is 4 bytes, while on Windows, it's 2 bytes.
What about Unicode, then?
The problem is that neither char
nor wchar_t
is directly tied to unicode.
On Linux?
Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:
#include <cstring>
#include <iostream>
int main()
{
const char text[] = "olé";
std::cout << "sizeof(char) : " << sizeof(char) << "\n";
std::cout << "text : " << text << "\n";
std::cout << "sizeof(text) : " << sizeof(text) << "\n";
std::cout << "strlen(text) : " << strlen(text) << "\n";
std::cout << "text(ordinals) :";
for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
{
unsigned char c = static_cast<unsigned_char>(text[i]);
std::cout << " " << static_cast<unsigned int>(c);
}
std::cout << "\n\n";
// - - -
const wchar_t wtext[] = L"olé" ;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << "\n";
//std::cout << "wtext : " << wtext << "\n"; <- error
std::cout << "wtext : UNABLE TO CONVERT NATIVELY." << "\n";
std::wcout << L"wtext : " << wtext << "\n";
std::cout << "sizeof(wtext) : " << sizeof(wtext) << "\n";
std::cout << "wcslen(wtext) : " << wcslen(wtext) << "\n";
std::cout << "wtext(ordinals) :";
for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
{
unsigned short wc = static_cast<unsigned short>(wtext[i]);
std::cout << " " << static_cast<unsigned int>(wc);
}
std::cout << "\n\n";
}
outputs the following text:
sizeof(char) : 1
text : olé
sizeof(text) : 5
strlen(text) : 4
text(ordinals) : 111 108 195 169
sizeof(wchar_t) : 4
wtext : UNABLE TO CONVERT NATIVELY.
wtext : ol�
sizeof(wtext) : 16
wcslen(wtext) : 3
wtext(ordinals) : 111 108 233
You'll see the "olé" text in char
is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t
code as an exercise)
So, when working with a char
on Linux, you should usually end up using Unicode without even knowing it. And as std::string
works with char
, so std::string
is already unicode-ready.
Note that std::string
, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.
On Windows?
On Windows, this is a bit different. Win32 had to support a lot of application working with char
and on different charsets/codepages produced in all the world, before the advent of Unicode.
So their solution was an interesting one: If an application works with char
, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine, which could not be UTF-8 for a long time. For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.
For Unicode based applications, Windows uses wchar_t
, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, UCS-2, which just lacks surrogate-pairs and thus characters outside the BMP (>= 64K)).
Applications using char
are said "multibyte" (because each glyph is composed of one or more char
s), while applications using wchar_t
are said "widechar" (because each glyph is composed of one or two wchar_t
. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.
Thus, if you work on Windows, you badly want to use wchar_t
(unless you use a framework hiding that, like GTK or QT...). The fact is that behind the scenes, Windows works with wchar_t
strings, so even historical applications will have their char
strings converted in wchar_t
when using API like SetWindowText()
(low level API function to set the label on a Win32 GUI).
Memory issues?
UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).
If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.
Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.
All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.
See https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.
Conclusion
When I should use std::wstring over std::string?
On Linux? Almost never (§).
On Windows? Almost always (§).
On cross-platform code? Depends on your toolkit...
(§) : unless you use a toolkit/framework saying otherwise
Can std::string
hold all the ASCII character set including special characters?
Notice: A std::string
is suitable for holding a 'binary' buffer, where a std::wstring
is not!
On Linux? Yes.
On Windows? Only special characters available for the current locale of the Windows user.
Edit (After a comment from Johann Gerell):
a std::string
will be enough to handle all char
-based strings (each char
being a number from 0 to 255). But:
- ASCII is supposed to go from 0 to 127. Higher
char
s are NOT ASCII.
- a
char
from 0 to 127 will be held correctly
- a
char
from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.
Is std::wstring
supported by almost all popular C++ compilers?
Mostly, with the exception of GCC based compilers that are ported to Windows.
It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.
What is exactly a wide character?
On C/C++, it's a character type written wchar_t
which is larger than the simple char
character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).
Best Answer
I found myself disagreeing with the highest-voted answer, so I went looking for expert opinons and here they are. From http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2011-Scott-Andrei-and-Herb-Ask-Us-Anything
Herb Sutter: "when you pass shared_ptrs, copies are expensive"
Scott Meyers: "There's nothing special about shared_ptr when it comes to whether you pass it by value, or pass it by reference. Use exactly the same analysis you use for any other user defined type. People seem to have this perception that shared_ptr somehow solves all management problems, and that because it's small, it's necessarily inexpensive to pass by value. It has to be copied, and there is a cost associated with that... it's expensive to pass it by value, so if I can get away with it with proper semantics in my program, I'm gonna pass it by reference to const or reference instead"
Herb Sutter: "always pass them by reference to const, and very occasionally maybe because you know what you called might modify the thing you got a reference from, maybe then you might pass by value... if you copy them as parameters, oh my goodness you almost never need to bump that reference count because it's being held alive anyway, and you should be passing it by reference, so please do that"
Update: Herb has expanded on this here: http://herbsutter.com/2013/06/05/gotw-91-solution-smart-pointer-parameters/, although the moral of the story is that you shouldn't be passing shared_ptrs at all "unless you want to use or manipulate the smart pointer itself, such as to share or transfer ownership."