R – What exactly is copy-on-modify semantics in R, and where is the canonical source

pass-by-referencepass-by-valuer

Every once in a while I come across the notion that R has copy-on-modify semantics, for example in Hadley's devtools wiki.

Most R objects have copy-on-modify semantics, so modifying a function
argument does not change the original value

I can trace this term back to the R-Help mailing list. For example, Peter Dalgaard wrote in July 2003:

R is a functional language, with lazy evaluation and weak dynamic
typing (a variable can change type at will: a <- 1 ; a <- "a" is
allowed). Semantically, everything is copy-on-modify although some
optimization tricks are used in the implementation to avoid the worst
inefficiencies.

Similarly, Peter Dalgaard wrote in Jan 2004:

R has copy-on-modify semantics (in principle and sometimes in
practice) so once part of an object changes, you may have to look in
new places for anything that contained it, including possibly the
object itself.

Even further back, in Feb 2000 Ross Ihaka said:

We put quite a bit of work into making this happen. I would describe
the semantics as "copy on modify (if necessary)". Copying is done
only when objects are modified. The (if necessary) part means that if
we can prove that the modification cannot change any non-local
variables then we just go ahead and modify without copying.

It's not in the manual

No matter how hard I've searched, I can't find a reference to "copy-on-modify" in the R manuals, neither in R Language Definition nor in R Internals

Question

My question has two parts:

Where is this formally documented?
How does copy-on-modify work?

For example, is it proper to talk about "pass-by-reference", since a promise gets passed to the function?

Best Answer

Call-by-value

The R Language Definition says this (in section 4.3.3 Argument Evaluation)

The semantics of invoking a function in R argument are call-by-value. In general, supplied arguments behave as if they are local variables initialized with the value supplied and the name of the corresponding formal argument. Changing the value of a supplied argument within a function will not affect the value of the variable in the calling frame. [Emphasis added]

Whilst this does not describe the mechanism by which copy-on-modify works, it does mention that changing an object passed to a function doesn't affect the original in the calling frame.

Additional information, particularly on the copy-on-modify aspect are given in the description of SEXPs in the R Internals manual, section 1.1.2 Rest of Header. Specifically it states [Emphasis added]

The named field is set and accessed by the SET_NAMED and NAMED macros, and take values 0, 1 and 2. R has a 'call by value' illusion, so an assignment like
b <- a
appears to make a copy of a and refer to it as b. However, if neither a nor b are subsequently altered there is no need to copy. What really happens is that a new symbol b is bound to the same value as a and the named field on the value object is set (in this case to 2). When an object is about to be altered, the named field is consulted. A value of 2 means that the object must be duplicated before being changed. (Note that this does not say that it is necessary to duplicate, only that it should be duplicated whether necessary or not.) A value of 0 means that it is known that no other SEXP shares data with this object, and so it may safely be altered. A value of 1 is used for situations like
dim(a) <- c(7, 2)
where in principle two copies of a exist for the duration of the computation as (in principle)
a <- `dim<-`(a, c(7, 2))
but for no longer, and so some primitive functions can be optimized to avoid a copy in this case.

Whilst this doesn't describe the situation whereby objects are passed to functions as arguments, we might deduce that the same process operates, especially given the information from the R Language definition quoted earlier.

Promises in function evaluation

I don't think it is quite correct to say that a promise is passed to the function. The arguments are passed to the function and the actual expressions used are stored as promises (plus a pointer to the calling environment). Only when an argument gets evaluated is the expression stored in the promise retrieved and evaluated within the environment indicated by the pointer, a process known as forcing.

As such, I don't believe it is correct to talk about pass-by-reference in this regard. R has call-by-value semantics but tries to avoid copying unless a value passed to an argument is evaluated and modified.

The NAMED mechanism is an optimisation (as noted by @hadley in the comments) which allows R to track whether a copy needs to be made upon modification. There are some subtleties involved with exactly how the NAMED mechanism operates, as discussed by Peter Dalgaard (in the R Devel thread @mnel cites in their comment to the question)

Function that just reads / uses the parameter

pass by value:      0.12065005 seconds
pass by reference:  1.52171397 seconds

Function to write / change the parameter

pass by value:      1.52223396 seconds
pass by reference:  1.52388787 seconds

Conclusions

Pass the parameter by value is always faster
If the function change the value of the variable passed, for practical purposes is the same as pass by reference than by value

R – What are the differences between “=” and “<-" assignment operators in R

The difference in assignment operators is clearer when you use them to set an argument value in a function call. For example:

median(x = 1:10)
x   
## Error: object 'x' not found

In this case, x is declared within the scope of the function, so it does not exist in the user workspace.

median(x <- 1:10)
x    
## [1]  1  2  3  4  5  6  7  8  9 10

In this case, x is declared in the user workspace, so you can use it after the function call has been completed.

There is a general preference among the R community for using <- for assignment (other than in function signatures) for compatibility with (very) old versions of S-Plus. Note that the spaces help to clarify situations like

x<-3
# Does this mean assignment?
x <- 3
# Or less than?
x < -3

Most R IDEs have keyboard shortcuts to make <- easier to type. Ctrl + = in Architect, Alt + - in RStudio (Option + - under macOS), Shift + - (underscore) in emacs+ESS.

If you prefer writing = to <- but want to use the more common assignment symbol for publicly released code (on CRAN, for example), then you can use one of the tidy_* functions in the formatR package to automatically replace = with <-.

library(formatR)
tidy_source(text = "x=1:5", arrow = TRUE)
## x <- 1:5

The answer to the question "Why does x <- y = 5 throw an error but not x <- y <- 5?" is "It's down to the magic contained in the parser". R's syntax contains many ambiguous cases that have to be resolved one way or another. The parser chooses to resolve the bits of the expression in different orders depending on whether = or <- was used.

To understand what is happening, you need to know that assignment silently returns the value that was assigned. You can see that more clearly by explicitly printing, for example print(x <- 2 + 3).

Secondly, it's clearer if we use prefix notation for assignment. So

x <- 5
`<-`(x, 5)  #same thing

y = 5
`=`(y, 5)   #also the same thing

The parser interprets x <- y <- 5 as

`<-`(x, `<-`(y, 5))

We might expect that x <- y = 5 would then be

`<-`(x, `=`(y, 5))

but actually it gets interpreted as

`=`(`<-`(x, y), 5)

This is because = is lower precedence than <-, as shown on the ?Syntax help page.