Is your storage not set to allow image files? Go to Datacenter -> Storage tab, select your storage and edit. Under "Content", make sure that "Images" is selected.
I just checked mine and am seeing the same thing as the OP. I would scp the file up and then manually edit the VM's .conf file.
Local storage on proxmox is in /var/lib/vz. There should be an "images" subdirectory with a directory there for each VM (named by the VM number). You can scp the files directly there.
For adding the existing file to the VM, I've had good luck editing the VM's .conf file directly. Look in /etc/pve/qemu-server/ for a file with the VM number followed by .conf.
It's a good idea to create a second test VM so you can refer to it's .conf file to make sure you get the syntax right.
-- adding text from a comment below
I think you can scp the file up. You probably want to look at /var/lib/vz/images/{VMID}/ for the destination. Then maybe look at /etc/pve/qemu-server/{VMID}.conf and add a line for the storage.
I'm almost certain your problem is not caused by just one single factor but rather by a combination of factors. What those individual factors are is not certain, but most likely one factor is either the network interface or driver and another factor is found on the switch itself. Hence it is quite likely the problem can only be reproduced with this particular brand of switch combined with this particular brand of network interface.
You seem the trigger for the problem is something happening on one individual server which then has a kernel panic which has effects that somehow manage to propagate across the switch. This sounds likely, but I'd say it is about as likely, that the trigger is somewhere else.
It could be that something is happening on the switch or network interface, which simultaneously causes the kernel panic and link issues on the switch. In other words, even if the kernel had not had a kernel panic, the trigger may very well have brought down connectivity on the switch.
One has to ask, what could possibly happen on the individual server, which could have this effect on the other servers. It shouldn't be possible, so the explanation has to involve a flaw somewhere in the system.
If it was just the link between the crashed server and the switch which went down or became unstable, then that should have no effect on the link state to the other servers. If it does, that would count as a flaw in the switch. And trafficwise, the other servers should see slightly less traffic once the crashed server lost connectivity, which cannot explain why they see the problem they do.
This leads me to believe a design flaw on the switch is likely.
However a link problem is not the first explanation one would look for when trying to explain how an issue on one server could cause problems to other servers on the switch. A broadcast storm would be a more obvious explanation. But could there be a link between a server having a kernel panic and a broadcast storm?
Multicast and packets destined for unknown MAC addresses are more or less treated the same as broadcasts, so a storm of such packets would count as well. Could the paniced server be trying to send a crashdump across the network to a MAC address not recognized by the switch?
If that's the trigger, then something is going wrong on the other servers. Because a packet storm should not cause this kind of error on the network interface. Reset adapter unexpectedly
does not sound like a packet storm (which should just cause a drop in performance but no errors as such), and it does not sound like an link problem (which should have resulted in messages about links going down, but not the error you are seeing).
So it is likely there is some flaw in the network interface hardware or driver, which is triggered by the switch.
A few suggestions that can give additional clues:
- Can you hook up some other equipment to the switch and look at what traffic you see on the switch when the problem shows up (I predict it either goes quiet or you see a flood).
- Would it be possible to replace the network interface on one of the servers with a different brand using a different driver to see how the result turns out differently?
- Is it possible to replace one of the switches with a different brand? I expect replacing the switch will ensure the problem no longer affects multiple servers. What's more interesting to know is if it also stops the kernel panics from happening.
Best Answer
First, backup what you have. Ideally all nodes and all VM/CT data.
Then, recover a cluster to where it is stable. Determine the cause of your last data loss. Crash it and be sure it comes back. fsck file systems to be sure data survives.
Now you can rebuild. The .raw should contain a filesystem, which yes you can use again. At least you can mount it as is and try to recover data.
Regarding the .conf files, those are on Proxmox's replicated database backed file system. See Proxmox Cluster file system (pmxcfs) to read more about pmxcfs. In particular, you might be able to stop pve-cluster on a node with the same name, replace config.db, and reboot.
There is not a lot to the .conf though, it contains the options for when you set up the container. Rebuilding the config is always an option. Then stop the CT and replace the .raw with what you have. Note the IP or MAC on the network interface may be different if you didn't recover the previous addresses.
Although, the web UI doesn't seem to allow editing the disk. If you need to change the path or size, use a text editor on the .conf.