You seem to have the general idea.
First, there really is no Physical Layer header. The Physical Layer simply takes what the Data-Link Layer gives it, and it encodes the Data-Link frame (encoding) for the interface and places that on the "wire" (signalling), or it performs the opposite when receiving data.
When layer-3 sends a packet to layer-2, layer-2 needs to encapsulate that with a layer-2 header. Part of the layer-2 header may include a MAC address for layer-2 protocols that use MAC addresses (not all do, and of those that do, some use 48-bit MAC addresses, and some use 64-bit MAC addresses).
In order to put a destination MAC address in the layer-2 frame, the destination layer-3 address must be resolved to a layer-2 address. Layer-2 needs the destination layer-2 MAC address in order to build the layer-2 frame to encapsulate the layer-3 packet. That is where ARP (Address Resolution Protocol) comes in.
ARP has to do with data leaving the host, not coming into the host where the headers are stripped. Depending on the network stack implementation, as a layer-2 frame come into a host, the MAC address may get saved in the host's ARP cache as the frame is stripped from the packet. Layer-2 will inspect the layer-2 frame to determine to which layer-3 protocol in the network stack the frame payload (layer-3 packet) should be sent.
The Ethertype field is a field in ethernet frame headers. Other layer-2 protocols have other ways of doing this. Remember, ethernet may be the most used layer-2 protocol, but it is not the only layer-2 protocol, and each have their own frame headers. Layer-3 protocols can have the the same type of thing. For instance, IPv4 has the Protocol field, and IPv6 has the Next Header field, to tell layer-3 to which layer-4 protocol the payload of the layer-3 packet should be sent.
I'll start with the oft-repeated caveat (at least by me, anyway) that things don't fit neatly into the OSI model.
"Ping" is the name of an application that generates ICMP echo request packets and receives echo reply packets. ICMP doesn't neatly fit into the OSI or TCP/IP model, so you can call it layer 3 or layer 3.5, depending on your point of view.
Best Answer
In addition to the data some metadata must be communicated betwen the application, transport and internet layers.
Technically how metadata is communicated between layers is an implementation detail. In practice the application layer nearly always uses some variant of the berkerly sockets API to talk to the transport layer.
For TCP clients the destination IP and port are specified to the transport layer as part of the "connect" API call. For UDP clients either "connect" can be used to create a psuedo-connnection or the destination IP and port can be specified on a per-packet basis with the "sendto" api call.
For TCP servers the application can read the IP and port by using by calling getpeername after accepting a connection. UDP servers can read the IP and port for each packet by reading packets using the recvfrom API call.
Unfortunately sendto and recvfrom have a design flaw. They only pass the remote address, not the local one which can cause problems for servers on multihomed hosts. The server may send replies from the wrong IP address causing them to be dropped, either by the network or the client. There are newer APIs to deal with this but the details vary between operating systems.
The transport layer will in turn inform the internet layer of the IP addresses for outgoing packets and the internet layer will inform the transport layer of the IP addresses for incoming packets. Since both the transport and internet layers are typically part of the TCP/IP stack the details of how this is done is an implementation detail inside the stack.
x-forwarded-for is a http header used by http proxies. The proxy will retrieve the client IP address using getpeername, it will then encode it into a http header to pass it on to the next server.