From Robert.Grosschopff at suse.com Tue Aug 7 08:53:15 2018 From: Robert.Grosschopff at suse.com (Robert Grosschopff) Date: Tue, 7 Aug 2018 14:53:15 +0000 Subject: [Deepsea-users] timeout disengage.safety Message-ID: Hi *, I'd like to purge an existing cluster. Using 'salt-run disengage.safety; salt-run state.orch ceph.purge' Unfortunately, it takes salt-run almost 30 seconds to come back. By the time the purge runs safety is already engaged again. Where can I increase the time so I can get rid of the cluster without reinstalling it from scratch. Where can I see why it takes so long ? On some systems with deepsea version 0.8.4+git.0.a58d1c5d4 I do not get any dmidecode not found messages on others (0.8.2+git.0.6b39c2648) I keep getting that (useless) error message. Can I just download a current deepsea version from git and do a "make install" or will it screw up the system ? Thanks Robert From Robert.Grosschopff at suse.com Tue Aug 7 15:01:55 2018 From: Robert.Grosschopff at suse.com (Robert Grosschopff) Date: Tue, 7 Aug 2018 21:01:55 +0000 Subject: [Deepsea-users] timeout disengage.safety In-Reply-To: <6162379.QuZyDaMG8i@fury.home> References: <6162379.QuZyDaMG8i@fury.home> Message-ID: <5FF07872-6CBF-4664-9568-49CA1CB3C620@suse.com> Hi Eric, thanks a lot. Will run the individual steps tomorrow. My feeling is that the individual stages take too long to finish. If the timeout is one minute (I thought it is 300 seconds) then that explains it. Doing a 'date; salt-run disengage.safety; date; salt-run disengage.check; date salt-run disengage.check' showed me the last check returns a false. No wonder the purge does not go through. test.ping works fine. Regular pings as well. A salt-run net.ping tries to ping the cluster network IPs from the public IP which it can't reach and thus takes quite long to come back. I suppose this is how it should be. Excluding the cluster IPs makes it fast and the rtt is somewhere around 0.9 ms. Not sure where to look for in order to determine why disengage.safety takes about 30-40 seconds causing ceph.purge to run into the timeout. Thanks Robert ?-----Original Message----- From: on behalf of Eric Jackson Reply-To: Discussions about the DeepSea management framework for Ceph Date: Tuesday, 7. August 2018 at 21:05 To: "deepsea-users at lists.suse.com" Subject: Re: [Deepsea-users] timeout disengage.safety Hi Robert, The timeout is one minute. Are all the minions responsive? As far as purging without the check, you can run the three steps in /srv/salt/ceph/purge/default.sls directly. salt 'admin*' state.apply ceph.reset salt -I cluster:ceph state.apply ceph.rescind.storage.terminate salt -I cluster:ceph state.apply ceph.rescind Or copy the default.sls to another name such as mypurge.sls, remove the check and then run it salt-run state.orch ceph.purge.mypurge Eric On Tuesday, August 7, 2018 10:53:15 AM EDT Robert Grosschopff wrote: > Hi *, > > I'd like to purge an existing cluster. Using 'salt-run disengage.safety; > salt-run state.orch ceph.purge' > > Unfortunately, it takes salt-run almost 30 seconds to come back. By the time > the purge runs safety is already engaged again. Where can I increase the > time so I can get rid of the cluster without reinstalling it from scratch. > Where can I see why it takes so long ? On some systems with deepsea version > 0.8.4+git.0.a58d1c5d4 I do not get any dmidecode not found messages on > others (0.8.2+git.0.6b39c2648) I keep getting that (useless) error message. > Can I just download a current deepsea version from git and do a "make > install" or will it screw up the system ? > > Thanks > Robert > > _______________________________________________ > Deepsea-users mailing list > Deepsea-users at lists.suse.com > http://lists.suse.com/mailman/listinfo/deepsea-users From alanj at supermicro.com Tue Aug 7 15:40:05 2018 From: alanj at supermicro.com (Alan Johnson) Date: Tue, 7 Aug 2018 21:40:05 +0000 Subject: [Deepsea-users] timeout disengage.safety In-Reply-To: <5FF07872-6CBF-4664-9568-49CA1CB3C620@suse.com> References: <6162379.QuZyDaMG8i@fury.home> <5FF07872-6CBF-4664-9568-49CA1CB3C620@suse.com> Message-ID: <70f76625abf94b95b82d95b323d86325@EX2013-MBX3.supermicro.com> I see the same thing so what I did was to run the disengage safety in one window and then recall the command prior to the minute elapsing until the cluster started to purge, but I agree one minute is too short - could it be made configurable such as passing an argument? Thx Alan -----Original Message----- From: deepsea-users-bounces at lists.suse.com [mailto:deepsea-users-bounces at lists.suse.com] On Behalf Of Robert Grosschopff Sent: Tuesday, August 7, 2018 2:02 PM To: Discussions about the DeepSea management framework for Ceph Subject: Re: [Deepsea-users] timeout disengage.safety Hi Eric, thanks a lot. Will run the individual steps tomorrow. My feeling is that the individual stages take too long to finish. If the timeout is one minute (I thought it is 300 seconds) then that explains it. Doing a 'date; salt-run disengage.safety; date; salt-run disengage.check; date salt-run disengage.check' showed me the last check returns a false. No wonder the purge does not go through. test.ping works fine. Regular pings as well. A salt-run net.ping tries to ping the cluster network IPs from the public IP which it can't reach and thus takes quite long to come back. I suppose this is how it should be. Excluding the cluster IPs makes it fast and the rtt is somewhere around 0.9 ms. Not sure where to look for in order to determine why disengage.safety takes about 30-40 seconds causing ceph.purge to run into the timeout. Thanks Robert ?-----Original Message----- From: on behalf of Eric Jackson Reply-To: Discussions about the DeepSea management framework for Ceph Date: Tuesday, 7. August 2018 at 21:05 To: "deepsea-users at lists.suse.com" Subject: Re: [Deepsea-users] timeout disengage.safety Hi Robert, The timeout is one minute. Are all the minions responsive? As far as purging without the check, you can run the three steps in /srv/salt/ceph/purge/default.sls directly. salt 'admin*' state.apply ceph.reset salt -I cluster:ceph state.apply ceph.rescind.storage.terminate salt -I cluster:ceph state.apply ceph.rescind Or copy the default.sls to another name such as mypurge.sls, remove the check and then run it salt-run state.orch ceph.purge.mypurge Eric On Tuesday, August 7, 2018 10:53:15 AM EDT Robert Grosschopff wrote: > Hi *, > > I'd like to purge an existing cluster. Using 'salt-run disengage.safety; > salt-run state.orch ceph.purge' > > Unfortunately, it takes salt-run almost 30 seconds to come back. By the time > the purge runs safety is already engaged again. Where can I increase the > time so I can get rid of the cluster without reinstalling it from scratch. > Where can I see why it takes so long ? On some systems with deepsea version > 0.8.4+git.0.a58d1c5d4 I do not get any dmidecode not found messages on > others (0.8.2+git.0.6b39c2648) I keep getting that (useless) error message. > Can I just download a current deepsea version from git and do a "make > install" or will it screw up the system ? > > Thanks > Robert > > _______________________________________________ > Deepsea-users mailing list > Deepsea-users at lists.suse.com > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.suse.com_mailman_listinfo_deepsea-2Dusers&d=DwIGaQ&c=4DxX-JX0i28X6V65hK0ft5M-1rZQeWgdMry9v8-eNr4&r=eqMv5yFFe6-lAM9jJfUusNFzzcFAGwmoAez_acfPOtw&m=EyLbcFJZDxxhl1sEw_hIBvemGAlBFH2ydQ6NE9IQaIQ&s=ccJjknV8dMnSDZtcJisxuye2c0AoTXcA0OLTb8OnNGk&e= _______________________________________________ Deepsea-users mailing list Deepsea-users at lists.suse.com https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.suse.com_mailman_listinfo_deepsea-2Dusers&d=DwIGaQ&c=4DxX-JX0i28X6V65hK0ft5M-1rZQeWgdMry9v8-eNr4&r=eqMv5yFFe6-lAM9jJfUusNFzzcFAGwmoAez_acfPOtw&m=EyLbcFJZDxxhl1sEw_hIBvemGAlBFH2ydQ6NE9IQaIQ&s=ccJjknV8dMnSDZtcJisxuye2c0AoTXcA0OLTb8OnNGk&e= From jschmid at suse.de Wed Aug 8 02:20:25 2018 From: jschmid at suse.de (Joshua Schmid) Date: Wed, 8 Aug 2018 10:20:25 +0200 Subject: [Deepsea-users] timeout disengage.safety In-Reply-To: <70f76625abf94b95b82d95b323d86325@EX2013-MBX3.supermicro.com> References: <6162379.QuZyDaMG8i@fury.home> <5FF07872-6CBF-4664-9568-49CA1CB3C620@suse.com> <70f76625abf94b95b82d95b323d86325@EX2013-MBX3.supermicro.com> Message-ID: <20180808082025.mskargnlhhuoi3mt@f154.suse.de> Alan Johnson wrote on Tue, 07. Aug 21:40: > I see the same thing so what I did was to run the disengage safety in one window and then recall the command prior to the minute elapsing until the cluster started to purge, but I agree one minute is too short - could it be made configurable such as passing an argument? > We have a reference implementation for exactly this. https://github.com/SUSE/DeepSea/pull/963 We can get it in if we find out that this is really needed. Are you saying that when you run `salt-run disengage.safety; salt-run state.orch ceph.purge` the purge process is not going through? > Thx > > Alan -- Joshua Schmid Software Engineer SUSE Enterprise Storage From Robert.Grosschopff at suse.com Wed Aug 8 06:03:24 2018 From: Robert.Grosschopff at suse.com (Robert Grosschopff) Date: Wed, 8 Aug 2018 12:03:24 +0000 Subject: [Deepsea-users] timeout disengage.safety In-Reply-To: <20180808082025.mskargnlhhuoi3mt@f154.suse.de> References: <6162379.QuZyDaMG8i@fury.home> <5FF07872-6CBF-4664-9568-49CA1CB3C620@suse.com> <70f76625abf94b95b82d95b323d86325@EX2013-MBX3.supermicro.com> <20180808082025.mskargnlhhuoi3mt@f154.suse.de> Message-ID: <13D0FCC6-AABC-4638-84D7-A9A29188B78D@suse.com> Hi *, problem is solved now. It was a network issue. public network : 172.16.2.0/24 cluster network : 172.16.1.0/24 The admin node had a second IP 192.168.168.220 attached to its interface. This caused deepsea to consider that IP as well causing long delays whenever a deepsea command was executed. 'salt test.ping' used only the addresses defined by the public network whereas 'salt-run net.ping' tried to ping from every available IP that is part of the admin node. I suspect that the other deepsea commands do not limit themselves to IPs that are part of the public network only. After removing that left-over IP all commands where _much_ faster. Thanks Robert ?-----Original Message----- From: on behalf of Joshua Schmid Reply-To: Discussions about the DeepSea management framework for Ceph Date: Wednesday, 8. August 2018 at 10:20 To: Discussions about the DeepSea management framework for Ceph Subject: Re: [Deepsea-users] timeout disengage.safety Alan Johnson wrote on Tue, 07. Aug 21:40: > I see the same thing so what I did was to run the disengage safety in one window and then recall the command prior to the minute elapsing until the cluster started to purge, but I agree one minute is too short - could it be made configurable such as passing an argument? > We have a reference implementation for exactly this. https://github.com/SUSE/DeepSea/pull/963 We can get it in if we find out that this is really needed. Are you saying that when you run `salt-run disengage.safety; salt-run state.orch ceph.purge` the purge process is not going through? > Thx > > Alan -- Joshua Schmid Software Engineer SUSE Enterprise Storage _______________________________________________ Deepsea-users mailing list Deepsea-users at lists.suse.com http://lists.suse.com/mailman/listinfo/deepsea-users