To make a VM mobile you want to be able to move it's physical location without changing it's apparent network location.
What that means is that we want to be able to put it on the same virtual Ethernet network regardless of which host machine it is sitting on. As long as a system is generating at least some broadcast traffic the Ethernet switches will quickly figure out it has moved and update their forwarding tables.
In a very small setup we might just put all our VMs on one flat Ethernet network. In such a setup we can migrate the VMs trivially. Downside here is that there is no isolation, every VM can talk directly to every other VM, broadcast traffic flows to all VMs and so-on.
A step up from that is VLANs. we can split our Ethernet network into a number of virtual Ethernet networks. If we can establish a link from any host box into any VLAN then again we can migrate our VM seamlessly. That works ok for moderate scales.
Unfortunately at large scales VLANs start to break down as a solution for decoupling physical and logical topologies. There are less than 4096 usable VLAN tags (not sure offhand how many values are reserved) and Ethernet's Tree structure makes it difficult to build reliable high-bandwidth networks. It is difficult to serve a default gateway IP from multiple locations, so traffic may travel considerable distances in the network before reaching the default gateway (and quite possibly being sent back the way it came)
Which is where VXLAN comes in, VXLAN lets you build virtual Ethernet overlay networks on top of an IP underlay network. It can either be used on it's own in a "learning" mode using IP multicast on the underlay network to carry broadcast unknown and multicast (BUM) traffic for the overlay network or it can be used in conjunction with MP-BGP with vxlan end points advertising MAC addresses and IP addresses for the VXLANs to each other over BGP and simulating a virtual default gateway at each endpoint. Other than needing to support slightly larger frames than normal (sometimes known as "baby jumbos") the underlay network is just a regular IP network.
Furthermore VXLAN is designed to allow scaling of the underlay network using techniques such as link-aggregation and equal cost multipath. To communicate flow information from the underlay network to the overlay network the UDP source port of the outer packet is based on a hash of headers of the inner packet.
VXLAN also allows over 16 million network IDs which should be more than enough even for very large datacenters.
All I can do is explain how a pretty successful large network does it.
Each of the hundreds to thousands of end-sites on an MPLS VPN is in the same private BGP AS, so site-to-site traffic is switched directly by the carrier MPLS cloud. The data centers each have their own private BGP ASes. So, the WAN is a mixture of iBGP and eBGP. Each end-site and data center runs its own separate IGP, injecting the default and specific routes from the MPLS cloud(s), although the standard defines only one, each site's IGP is independent from all the other sites'.
Some end-sites have one WAN circuit, and some sites have two WAN circuits. Of the sites with two WAN circuits, some have both circuits on one carrier (required to terminate at separate carrier POPs), and some have one circuit on each of two different carriers. Obviously, the data centers have large-pipe connections to all the carriers, but the end-site circuits are right-sized for the traffic to/from the particular site.
Each end-site gets a default route to the MPLS cloud, and a few specific prefixes from the data centers.
This was arrived at after many years of various arrangements. Using an IGP across the WAN for hundreds to thousands of sites just proved too problematic (actually slowing IGP convergence to a crawl), and forcing traffic to a central site, even if the traffic was site-to-site added too much latency.
Best Answer
Yes, from the packet switching point-of-view, VXLAN is just a matter of sticking some encapsulation on top of an L2 frame: something that other protocols do as well.
The real difference it makes is at the control and management layer.
VXLAN evolved as a Data Center technology, so the ability to span a WAN is just an additional advantage, not the thing that drives the technology.
Consider a cloud service provider, with a data center that can contain thousands upon thousands of virtual machines. These VMs can belong to different enterprises (the cloud provider's customers), and all doing different things, from running e-commerce applications, online shopping, ML/AI applications (like suggesting you what to buy for your wife for her birthday :-), managing calendars and meetings and so on.
In an environment like this, the 802.1Q VLAN limit of 4096 is laughably inadequate. The data center admins need a way to segment their network in more flexible and fine-grained ways.
Also, unlike say an enterprise's network wiring, which follows a hierarchical model (access -> distribution -> core), the devices in the data center need to be wired up in a more-or-less flat manner.
So basically imagine a huge flat LAN with a very large number of hosts.
Next, you also want to provide redundancy - protection against failure of individual switches and individual links. Spanning tree is of course a non-starter here: we want every link spewing data close to its max capacity. Hence the IP-based fabric, and the good things that IP comes with (like routing protocols, equal-cost multiple path support).
Next, when you get a new customer for your data center, you want to be able to deploy their VMs ASAP (in hours if not minutes), which means you want to add a new switch to the fabric without disturbing the existing switches. So, in a fabric that contains 77 switches, when you add the 78th, you most certainly do not want to spend time provisioning 77 L2TPv3 tunnels :-)
Hence the first line from Wikipedia's VXLAN page: "Virtual Extensible LAN (VXLAN) is a network virtualization technology that attempts to address the scalability problems associated with large cloud computing deployments"