OpenStack Resume实例报错
问题
今天Resume一个几天前Suspend的实例报以下错误
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/openstack/common/rpc/amqp.py", line 276, in _process_data
rval = self.proxy.dispatch(ctxt, version, method, **args)
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/openstack/common/rpc/dispatcher.py", line 145, in dispatch
return getattr(proxyobj, method)(ctxt, **kwargs)
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/exception.py", line 117, in wrapped
temp_level, payload)
File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
self.gen.next()
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/exception.py", line 92, in wrapped
return f(*args, **kw)
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/compute/manager.py", line 176, in decorated_function
pass
File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
self.gen.next()
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/compute/manager.py", line 162, in decorated_function
return function(self, context, *args, **kwargs)
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/compute/manager.py", line 197, in decorated_function
kwargs['instance']['uuid'], e, sys.exc_info())
File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
self.gen.next()
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/compute/manager.py", line 191, in decorated_function
return function(self, context, *args, **kwargs)
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/compute/manager.py", line 1895, in resume_instance
self.driver.resume(instance)
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/exception.py", line 117, in wrapped
temp_level, payload)
File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
self.gen.next()
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/exception.py", line 92, in wrapped
return f(*args, **kw)
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/virt/libvirt/driver.py", line 1014, in resume
self._create_domain(domain=dom)
File "/usr/lib/python2.6/site-packages/nova-2012.2.5-py2.6.egg/nova/virt/libvirt/driver.py", line 1921, in _create_domain
domain.createWithFlags(launch_flags)
File "/usr/lib/python2.6/site-packages/eventlet/tpool.py", line 187, in doit
result = proxy_call(self._autowrap, f, *args, **kwargs)
File "/usr/lib/python2.6/site-packages/eventlet/tpool.py", line 147, in proxy_call
rv = execute(f,*args,**kwargs)
File "/usr/lib/python2.6/site-packages/eventlet/tpool.py", line 76, in tworker
rv = meth(*args,**kwargs)
File "/usr/lib64/python2.6/site-packages/libvirt.py", line 650, in createWithFlags
if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: Unable to read from monitor: Connection reset by peer
在/var/log/libvirt/libvirtd.log有如下的错误
2013-06-19 02:22:11.826+0000: 6270: error : qemuMonitorIORead:484 : Unable to read from monitor: Connection reset by peer
在/var/log/libvirt/qemu/instance-00000082.log里有如下的错误
2013-06-19 02:20:56.120+0000: starting up
LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -S -M rhel6.3.0 -enable-kvm -m 4048 -smp 2,sockets=2,cores=1,threads=1 -name instance-00000082 -uuid 5152b583-801c-43a8-b1e9-77d3c9e29400 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000082.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/instance-00000082/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/disk/by-path/ip-10.61.2.14:3260-iscsi-iqn.2010-10.org.openstack:volume-14f91aaf-261d-4f0f-b5d0-ddb95d9aa206-lun-1,if=none,id=drive-virtio-disk3,format=raw,serial=14f91aaf-261d-4f0f-b5d0-ddb95d9aa206,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0xb,drive=drive-virtio-disk3,id=virtio-disk3 -drive file=/dev/disk/by-path/ip-10.61.2.15:3260-iscsi-iqn.2010-10.org.openstack:volume-e3427bc5-8f64-41ea-9b2d-96bf8b8a9abd-lun-1,if=none,id=drive-virtio-disk5,format=raw,serial=e3427bc5-8f64-41ea-9b2d-96bf8b8a9abd,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk5,id=virtio-disk5 -drive file=/dev/disk/by-path/ip-10.61.2.14:3260-iscsi-iqn.2010-10.org.openstack:volume-695fcb01-4ce1-4a62-a939-4e498a2dd06e-lun-1,if=none,id=drive-virtio-disk6,format=raw,serial=695fcb01-4ce1-4a62-a939-4e498a2dd06e,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk6,id=virtio-disk6 -netdev tap,fd=35,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=fa:16:3e:5b:21:05,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/instance-00000082/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 0.0.0.0:7 -k en-us -vga cirrus -incoming fd:28 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
char device redirected to /dev/pts/13
Unknown savevm section or instance '0000:00:06.0/virtio-blk' 0
load of migration failed
2013-06-19 02:22:11.868+0000: shutting down
Suspend及Resume过程
以前也Suspend然后Resume过实例并没有出现问题。首先我新建了一个实例然后执行Suspend完成后再Resume成功。难道是因为时间太久的原因?首先通过粗略观察,Suspend会 做以下几件事:
1)将实例的内存状态以“域名.save”保存到/var/lib/libvirt/qemu/save目录下,注意该文件的所有者为root,如下所示:
[root@stack6 ~]# ll /var/lib/libvirt/qemu/save/
总用量 5422824
-rw-------. 1 root root 4012820373 6月 5 16:44 instance-00000082.save
-rw-------. 1 root root 1540140463 6月 5 16:58 instance-000000e2.save
2)删除目录/var/lib/libvirt/qemu下的对应实例的“域名.monitor”文件
3)修改实例目录下的文件console.log、disk的所有者为root:root,如下:
[root@stack6 ~]# ll /var/lib/nova/instances/instance-00000082/
总用量 5377104
-rw-rw----. 1 root root 0 6月 19 12:43 console.log
-rw-r--r--. 1 root root 5506269184 6月 5 16:44 disk
-rw-r--r--. 1 nova nova 1421 11月 1 2012 libvirt.xml
suspend调用的是managedSave更准确详细的过程估计可以用 类似strace的工具跟踪virsh managedsave 来获得,如下,而resume就是suspend的逆过程但是调用的是createWithFlags,virsh下没有找到对应的命令
[root@stack5 ~]# strace -o managedsave.log virsh managedsave instance-00000193
解决方法
Google了半天没有找到比较好的方法,找到的唯一可行的方法是在实例所在物理节点运行virsh managedsave-remove dom删除保存的内存镜像文件然后再重启实例, 这样的话suspend前的内存状态将会丢失,实例重新启动,如下:
[root@stack6 ~]# virsh managedsave-remove instance-00000082
Removed managedsave image for domain instance-00000082
补充:通过上述方法删除内存镜像文件重启后再进行suspend和resume操作,一切正常,好神奇!
北方工业大学 | 云计算研究中心 | 姜永