The answer is simply that the TCP segment can be fragmented into many IP packets. The Maximum Segment Size (mss) is the largest TCP segment that can fit into a single IP packet. The size of the IP packet is limited by the Maximum Transmission Unit (MTU), which is dependent on the physical media.
You seem to have the general idea.
First, there really is no Physical Layer header. The Physical Layer simply takes what the Data-Link Layer gives it, and it encodes the Data-Link frame (encoding) for the interface and places that on the "wire" (signalling), or it performs the opposite when receiving data.
When layer-3 sends a packet to layer-2, layer-2 needs to encapsulate that with a layer-2 header. Part of the layer-2 header may include a MAC address for layer-2 protocols that use MAC addresses (not all do, and of those that do, some use 48-bit MAC addresses, and some use 64-bit MAC addresses).
In order to put a destination MAC address in the layer-2 frame, the destination layer-3 address must be resolved to a layer-2 address. Layer-2 needs the destination layer-2 MAC address in order to build the layer-2 frame to encapsulate the layer-3 packet. That is where ARP (Address Resolution Protocol) comes in.
ARP has to do with data leaving the host, not coming into the host where the headers are stripped. Depending on the network stack implementation, as a layer-2 frame come into a host, the MAC address may get saved in the host's ARP cache as the frame is stripped from the packet. Layer-2 will inspect the layer-2 frame to determine to which layer-3 protocol in the network stack the frame payload (layer-3 packet) should be sent.
The Ethertype field is a field in ethernet frame headers. Other layer-2 protocols have other ways of doing this. Remember, ethernet may be the most used layer-2 protocol, but it is not the only layer-2 protocol, and each have their own frame headers. Layer-3 protocols can have the the same type of thing. For instance, IPv4 has the Protocol field, and IPv6 has the Next Header field, to tell layer-3 to which layer-4 protocol the payload of the layer-3 packet should be sent.
Best Answer
Remember that the OSI model is just a model, and nothing in the real world actually adheres to it.
I believe what this is trying to get across to you is that the application in one host is peering with the application in the other host. Also, the transport protocol in one host is peering with the transport protocol in the other host, the network protocol in one host is peering with the network protocol in the other host, and the data-link protocol in one host is peering with the data-link protocol in the other host*.
The data that one application sends to the other application ends up in the destination application unchanged. Yes, as the data moves down the network stack in the sending host, it gets headers from the various network layers attached to it, but as it travels up the network stack in the destination host, those headers are stripped off, leaving the original data from the source unchanged.
Each network layer in the source host adds a header for the corresponding network layer in the destination host, and the corresponding network layer in the destination host will strip off the header, leaving the PDU for the next layer unchanged form the source.
*This hold true for the data-link layer only if both hosts are are the same data-link LAN. If the network packet must cross to another LAN, each router in the path will strip off the data-link header, replacing it with its own data-link header for the next network through which it will forward the network packet.