Finished the week with an interesting support call. To make a long story short, customer ended up with a non-responsive VM. We tried to open the console on the VM but got the following error more or less (not exact path but you get idea):
Error connecting: Error connecting to /vmfs/volumes/47a23275-63d1cb52-6968-0019b9e5c637/vCenter/vCenter.vmx because the VMX is not started.
Other VMs opened perfectly fine on this same ESX server, just this one VM was hosed. Couldn’t ping it’s IP either. Tried restarting mgmt-vmware from the service console, and that removed the VMname from the ESX servers inventory the next time we logged in. Just some weird placeholder VM instead, which I ended up removing from inventory. Next tried to re-add the vCenter VM to inventory by browsing to the datastore. No luck, this process hung. So restarted mgmt-vmware again. And this time decided to look at esxtop to see if this VM was still running or something. And it was..or at least something was running with its name. So now I set out to restart it with vmware-cmd.
Ran vmware-cmd from the service console and vCenter did not show up as a running VM. Weird! It’s in esxtop but not vmware-cmd -l. So now I need to try to find the process for this hung, posessed VM and kill it. So I tried the following
ps -ef | grep vCenter that I found on Google here: http://www.esxguide.com/esx/content/view/11/14/ . Cool, but this kept returning a ever-increasing and chanign PID. Not cool.
Then I discovered this gem from the VMware Communities . First up attack, new PS argument from page 8 of the pdf.
ps axu | grep vCenter -> Still a fail. PID kept incrementing every time I ran the command. Is something relaunching, or what? Must find root process but how?
Next up, /proc-FU. On Page 9, michaelstan of the communities, suggests the following, which I follow verbatim and looked for my vm named “vCenter”:
(at the cmd prompt enter) cat /proc/vmware/vm/*/names
This lists the running VM’s on the host server you are logged on to.
vmid=1069 pid=-1 cfgFile=”/vmfs/volumes/45…/server1/server1.vmx” uuid=”50…” displayName=”server1″
vmid=1107 pid=-1 cfgFile=”/vmfs/volumes/45…/server2/server2.vmx” uuid=”50…” displayName=”server2″
vmid=1149 pid=-1 cfgFile=”/vmfs/volumes/45…/server3/server3.vmx” uuid=”50…” displayName=”server3″
vmid=1156 pid=-1 cfgFile=”/vmfs/volumes/45…/server4/server4.vmx” uuid=”50…” displayName=”server4″
vmid=1170 pid=-1 cfgFile=”/vmfs/volumes/45…/server5/server5.vmx” uuid=”50…” displayName=”server6″
vmid=1178 pid=-1 cfgFile=”/vmfs/volumes/45…/server6/server6.vmx” uuid=”50…” displayName=”server6″
vmid=1188 pid=-1 cfgFile=”/vmfs/volumes/45…/server7/server7.vmx” uuid=”50…” displayName=”server7″
vmid=1198 pid=-1 cfgFile=”/vmfs/volumes/45…/server8/server8.vmx” uuid=”50…” displayName=”server8″
[-If you are running ESX 2.5 then you can kill the vmx PID-]
If you are running ESX 3.0.x then you find group ID that controls the PID of the VM.
(at the cmd prompt enter) less -S /proc/vmware/vm/1149/cpu/status
vcpu vm type name uptime status costatus usedsec syssec wait waitsec idlesec (more…)
1149 1149 V vmm0:server3 350042.494 WAIT STOP 15968.954 518.916 COW 325800.734 322397.266 (more…)
Scroll right with the right arrow key to locate the “group” pid. In this case the group pid was 1148 (not shown in this
example)
Now with the group PID you can kill the VM safely without corrupting the VM as posted earlier.
(at the cmd prompt enter) /usr/lib/vmware/bin/vmkload_app -k 9 1148
Warning: Apr 20 16:22:22.710: Sending signal ‘9′ to world 1148.
THIS MEANS SUCCESS… if you receive another line then the process might not have been successful.
Hope this helps!
Michael Stan
In short, I did the following from the bold:
- cat /proc/vmware/vm/*/names
- “less -S /proc/vmware/vm/1149/cpu/status” where 1149 was the VMID of the VM in question (found with step 1) and then hit right arrow until I found the “group” pid.
- “/usr/lib/vmware/bin/vmkload_app -k 9 1148″ where 1148 was my group pid found in #2.
- received the following “success” message: “Warning: Apr 20 16:22:22.710: Sending signal ‘9′ to world 1148.” and ran esxtop to verify the VM was done running, which it was done.
- Re-added VM to inventory and all is well.
Haven’t had a hung VM since the ESX 2.5 days, so it was a fun little challenge to finish out my Friday afternoon. But thought I would quickly share for the benefit of all.
Ping me on Twitter if you have questions. vSeanClark
UPDATE: Jason Boche suggested I could have arrived at the PID w/ ps -auxwww | grep VM-Name. Well that would have been quite a bit simpler but wouldn’t have given me an opportunity to say /proc-FU again.
(at the cmd prompt enter) cat /proc/vmware/vm/*/names
This lists the running VM’s on the host server you are logged on to.
vmid=1069 pid=-1 cfgFile=”/vmfs/volumes/45…/server1/server1.vmx” uuid=”50…” displayName=”server1″
vmid=1107 pid=-1 cfgFile=”/vmfs/volumes/45…/server2/server2.vmx” uuid=”50…” displayName=”server2″
vmid=1149 pid=-1 cfgFile=”/vmfs/volumes/45…/server3/server3.vmx” uuid=”50…” displayName=”server3″
vmid=1156 pid=-1 cfgFile=”/vmfs/volumes/45…/server4/server4.vmx” uuid=”50…” displayName=”server4″
vmid=1170 pid=-1 cfgFile=”/vmfs/volumes/45…/server5/server5.vmx” uuid=”50…” displayName=”server6″
vmid=1178 pid=-1 cfgFile=”/vmfs/volumes/45…/server6/server6.vmx” uuid=”50…” displayName=”server6″
vmid=1188 pid=-1 cfgFile=”/vmfs/volumes/45…/server7/server7.vmx” uuid=”50…” displayName=”server7″
vmid=1198 pid=-1 cfgFile=”/vmfs/volumes/45…/server8/server8.vmx” uuid=”50…” displayName=”server8″
[-If you are running ESX 2.5 then you can kill the vmx PID-]
If you are running ESX 3.0.x then you find group ID that controls the PID of the VM.
(at the cmd prompt enter) less -S /proc/vmware/vm/1149/cpu/status
vcpu vm type name uptime status costatus usedsec syssec wait waitsec idlesec (more…)
1149 1149 V vmm0:server3 350042.494 WAIT STOP 15968.954 518.916 COW 325800.734 322397.266 (more…)
Scroll right with the right arrow key to locate the “group” pid. In this case the group pid was 1148 (not shown in this
example)
Now with the group PID you can kill the VM safely without corrupting the VM as posted earlier.
(at the cmd prompt enter) /usr/lib/vmware/bin/vmkload_app -k 9 1148
Warning: Apr 20 16:22:22.710: Sending signal ‘9′ to world 1148.
THIS MEANS SUCCESS… if you receive another line then the process might not have been successful.
Hope this helps!
Michael (at the cmd prompt enter) cat /proc/vmware/vm/*/names
This lists the running VM’s on the host server you are logged on to.
vmid=1069 pid=-1 cfgFile=”/vmfs/volumes/45…/server1/server1.vmx” uuid=”50…” displayName=”server1″
vmid=1107 pid=-1 cfgFile=”/vmfs/volumes/45…/server2/server2.vmx” uuid=”50…” displayName=”server2″
vmid=1149 pid=-1 cfgFile=”/vmfs/volumes/45…/server3/server3.vmx” uuid=”50…” displayName=”server3″
vmid=1156 pid=-1 cfgFile=”/vmfs/volumes/45…/server4/server4.vmx” uuid=”50…” displayName=”server4″
vmid=1170 pid=-1 cfgFile=”/vmfs/volumes/45…/server5/server5.vmx” uuid=”50…” displayName=”server6″
vmid=1178 pid=-1 cfgFile=”/vmfs/volumes/45…/server6/server6.vmx” uuid=”50…” displayName=”server6″
vmid=1188 pid=-1 cfgFile=”/vmfs/volumes/45…/server7/server7.vmx” uuid=”50…” displayName=”server7″
vmid=1198 pid=-1 cfgFile=”/vmfs/volumes/45…/server8/server8.vmx” uuid=”50…” displayName=”server8″
[-If you are running ESX 2.5 then you can kill the vmx PID-]
If you are running ESX 3.0.x then you find group ID that controls the PID of the VM.
(at the cmd prompt enter) less -S /proc/vmware/vm/1149/cpu/status
vcpu vm type name uptime status costatus usedsec syssec wait waitsec idlesec (more…)
1149 1149 V vmm0:server3 350042.494 WAIT STOP 15968.954 518.916 COW 325800.734 322397.266 (more…)
Scroll right with the right arrow key to locate the “group” pid. In this case the group pid was 1148 (not shown in this
example)
Now with the group PID you can kill the VM safely without corrupting the VM as posted earlier.
(at the cmd prompt enter) /usr/lib/vmware/bin/vmkload_app -k 9 1148
Warning: Apr 20 16:22:22.710: Sending signal ‘9′ to world 1148.
THIS MEANS SUCCESS… if you receive another line then the process might not have been successful.
Hope this helps!
Michael Stan