From david-m@XXXXXXXXXXXX Mon Dec 3 02:50:21 2007 Delivered-To: mpifrm-mpi-comments-outgoing@XXXXXXXXXXXXXXXXXXXXXXX X-Original-To: mpifrm-mpi-comments@XXXXXXXXXXXXXXXXXXXXXXX Delivered-To: mpifrm-mpi-comments@XXXXXXXXXXXXXXXXXXXXXXX X-Greylist: delayed 129 seconds by postgrey-1.21 at mailbouncer.mcs.anl.gov; Mon, 03 Dec 2007 02:50:10 CST X-IronPort-AV: i="4.23,242,1194213600"; d="cpp'?scan'208,217"; a="67313589:sNHT131866256" X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----_=_NextPart_001_01C83589.108AC65E" Subject: ambiguity with MPI_Cancel when used with MPI_THREAD_MULTIPLE Date: Mon, 3 Dec 2007 10:46:57 +0200 X-MS-Has-Attach: yes X-MS-TNEF-Correlator: Thread-Topic: ambiguity with MPI_Cancel when used with MPI_THREAD_MULTIPLE Thread-Index: Acg1iQrPr7mF1P/kSD23OuGlLzlFRg== From: "David Minor" To: X-OriginalArrivalTime: 03 Dec 2007 08:46:58.0759 (UTC) FILETIME=[10BBBD70:01C83589] X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at mailbouncer.mcs.anl.gov Sender: owner-mpi-comments@XXXXXXXXXXXXX Precedence: bulk X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at mailbouncer.mcs.anl.gov This is a multi-part message in MIME format. ------_=_NextPart_001_01C83589.108AC65E Content-Type: multipart/alternative; boundary="----_=_NextPart_002_01C83589.108AC65E" ------_=_NextPart_002_01C83589.108AC65E Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hello MPI'ers, =20 MPI_Cancel does not specify whether a cancel on an already completed request is an error or not. Some MPI's treat this as an error and some do not. The problem arises when trying to cancel multiple receive requests. Any loop that goes over every requests and calls cancel can run into the situation where the requests is completed (from another thread) before cancel is called. If cancel is treated as an error this will result in an abort. If another error handler is substituted it should be called and an error returned from MPI_Cancel (if indeed calling cancel on a completed request is an error). The c++ binding for Cancel() returns a void, so this is also a problem. Any attempt to test for completion before calling cancel can also fail because the completion can happen between the call to test completion and the actual cancel operation. I recommend adding a MPI_CancelAll() function that would take an array of requests and not consider it an error to cancel a completed request. I've included below postings from the MPICH forum relating to this as well as some example code where this problem can occur. The sample code will cause an abort under MPICH2 but not under Intel MPI for example. =20 Regards, David Minor Orbotech =20 =20 <<=20 Ah. That is an interesting case, but as you note, it violates the standard. Since the MPI 2.1 process is getting started, it might be best to raise the issue their; we can try prototyping solutions (such as a MPIX_WaitallWithCancel) in MPICH2. =20 Bill =20 On Mar 13, 2007, at 2:15 AM, David Minor wrote: =20 > Hi Bill, > The situation is this. A process issues a set if Irecv commands and=20 > then saves the requests. It starts a thread that does a WaitAll on=20 > those requests. Now how can it cancel the transaction before the=20 > WaitAll has completed? If it goes through the list of requests and=20 > cancels each one, it's in danger of cancelling an already completed=20 > request. If it tests each one first, between the Test() and the > Cancel() the request could complete. The user cannot manage a mutex=20 > over this because he has no access to the underlying mutex that allows > messages to complete (mutexes aren't composable!). It seems to me=20 > there is a problem here in the standard. What is really needed is a=20 > CancelAll() command which would mutex the completions. > Barring that I'm not sure what a possible solution is. I admit my=20 > solution violates the standard because it allows for Cancel() on a=20 > completed request but it also allows my application to work, which is=20 > necessary. :-) I'm preparing a comprehensive test of all these=20 > problems between WaitAll, Test and Cancel that I'll post as soon as=20 > it's done. > Regards, > David >=20 > -----Original Message----- > From: William Gropp [mailto:gropp@XXXXXXXXXXXX > Sent: Thursday, February 22, 2007 10:58 PM > To: David Minor > Cc: mpich2-maint@XXXXXXXXXXX > Subject: Re: [MPICH2 Req #3217] [MPICH] Ooops... forgot to include=20 > cancel.c in previous post... >=20 >>=20 >> The current version of MPICH2 has a race condition. If you try to=20 >> cancel a set of outstanding receive requests. It's possible that in=20 >> the middle of cancelling one of them will complete. Cancelling a=20 >> completed request results in an abort level failure. Checking for=20 >> completion before cancelling doesn't help because between the time=20 >> you checked and the time you cancel the request could have completed. >> It seems the standard didn't really think about this problem,=20 >> otherwise it would have added a cancelAll operation that would work=20 >> on a set of requests and be able to do the cancellation inside an=20 >> internal mutex. I've done a patch on cancel.c that corrects this by=20 >> not generating an error on canceling an already completed request. >> Does the standard allow this (in letter if not in spirit)? Enclosed=20 >> is my fix. Search for "dminor" in the file to see the patch. Let's >> re- >> open the discussion and fix this problem in the next release. >>=20 >=20 =20 ------_=_NextPart_002_01C83589.108AC65E Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Hello MPI'ers,

 

MPI_Cancel does not specify whether a cancel on an already completed request is an = error or not. Some MPI's treat this as an error and some do not. The problem = arises when trying to cancel multiple receive requests.  Any loop that = goes over every requests and calls cancel can run into the situation where the = requests is completed (from another thread) before cancel is called. If cancel is treated as an error this will result in an abort. If another error = handler is substituted it should be called and an error returned from MPI_Cancel = (if indeed calling cancel on a completed request is an error). The c++ = binding for Cancel() returns a void, so this is also a problem.  Any attempt to = test for completion before calling cancel can also fail because the = completion can happen between the call to test completion and the actual cancel operation.  I recommend adding a MPI_CancelAll() function that = would take an array of requests and not consider it an error to cancel a completed request.  I've included below postings from the MPICH forum = relating to this as well as some example code where this problem can occur. The = sample code will cause an abort under MPICH2 but not under Intel MPI for = example.

 

Regards,

David Minor

Orbotech

 

 

<< 

Ah.  That is = an interesting case, but as you note, it violates the standard.  Since = the MPI 2.1 process is getting started, it might be best to raise the issue = their; we can try prototyping solutions (such as a MPIX_WaitallWithCancel) in = MPICH2.

 

Bill

 

On Mar 13, 2007, at = 2:15 AM, David Minor wrote:

 

> Hi = Bill,

> The situation = is this. A process issues a set if Irecv commands and =

> then saves the requests. It starts a thread that does a WaitAll on =

> those = requests. Now how can it cancel the transaction before the

> WaitAll has = completed? If it goes through the list of requests and =

> cancels each = one, it's in danger of cancelling an already completed =

> request. If it = tests each one first, between the Test() and the

> Cancel() the = request could complete. The user cannot manage a mutex =

> over this = because he has no access to the underlying mutex that allows =

> messages to = complete (mutexes aren't composable!). It seems to me =

> there is a = problem here in the standard. What is really needed is a =

> CancelAll() = command which would mutex the completions.

> Barring that = I'm not sure what a possible solution is. I admit my =

> solution = violates the standard because it allows for Cancel() on a =

> completed = request but it also allows my application to work, which is =

> necessary. :-) = I'm preparing a comprehensive test of all these =

> problems = between WaitAll, Test and Cancel that I'll post as soon as =

> it's = done.

> = Regards,

> = David

> 

> -----Original Message-----

> From: William = Gropp [

> Sent: = Thursday, February 22, 2007 10:58 PM

> To: David = Minor

> Cc: mpich2-maint@XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

> Subject: Re: = [MPICH2 Req #3217] [MPICH] Ooops... forgot to include =

> cancel.c in = previous post...

> 

>> 

>> The = current version of MPICH2 has a race condition. If you try to =

>> cancel a = set of outstanding receive requests. It's possible that in =

>> the middle = of cancelling one of them will complete. Cancelling a =

>> completed = request results in an abort level failure. Checking for =

>> completion = before cancelling doesn't help because between the time =

>> you = checked and the time you cancel the request could have = completed.

>> It seems = the standard didn't really think about this problem, =

>> otherwise = it would have added a cancelAll operation that would work =

>> on a set = of requests and be able to do the cancellation inside an =

>> internal = mutex. I've done a patch on cancel.c that corrects this by =

>> not = generating an error on canceling an already completed = request.

>> Does the = standard allow this (in letter if not in spirit)?  Enclosed =

>> is my fix. = Search for "dminor" in the file to see the patch. = Let's

>> = re-

>> open the = discussion and fix this problem in the next release.

>> 

> 

 

------_=_NextPart_002_01C83589.108AC65E-- ------_=_NextPart_001_01C83589.108AC65E Content-Type: application/octet-stream; name="CancelBug.cpp" Content-Transfer-Encoding: base64 Content-Description: CancelBug.cpp Content-Disposition: attachment; filename="CancelBug.cpp" I2luY2x1ZGUgPG1waS5oPgojaW5jbHVkZSA8aW9zdHJlYW0+CiNpbmNsdWRlIDxwdGhyZWFkLmg+ CnVzaW5nIG5hbWVzcGFjZSBzdGQ7CgojZGVmaW5lIGdDb21tIE1QSTo6Q09NTV9XT1JMRAoKc3Ry dWN0IFRocmVhZEluZm8gewogICAgTVBJOjpSZXF1ZXN0ICpyZXE7CiAgICBpbnQgCQkgbnVtX3Jl cXVlc3RzOwp9OwoKdm9sYXRpbGUgYm9vbCBnU29tZUFycml2ZWQgPSBmYWxzZTsKCgpleHRlcm4g IkMiIHZvaWQqIExpc3RlbmVyRnVuY3Rpb24odm9pZCogZGF0YSkKewogICAgVGhyZWFkSW5mbyog aW5mbyA9IChUaHJlYWRJbmZvKilkYXRhOwogICAgcHJpbnRmKCJXYWl0aW5nIGZvciAlZCByZXF1 ZXN0cyB0byBjb21wbGV0ZS5cbiIsIGluZm8tPm51bV9yZXF1ZXN0cyk7CiAgICAvL3dhaXQgZm9y IG9uZSB0byBjb21wbGV0ZQogICAgTVBJOjpTdGF0dXMgc3RhdHVzOwogICAgaW50IGNvbXBsZXRl ZCA9IE1QSTo6UmVxdWVzdDo6V2FpdGFueShpbmZvLT5udW1fcmVxdWVzdHMsIGluZm8tPnJlcSwg c3RhdHVzKTsKCgkvL3JlLWNvcHkgcmVxdWVzdHMgdG8gYmUgY29udGlndW91cwoJTVBJOjpSZXF1 ZXN0IG5ld19yZXF1ZXN0c1tpbmZvLT5udW1fcmVxdWVzdHMgLSAxXTsKCWludCBpbmRleCA9IDA7 Cglmb3IgKGludCBpPTA7IGkgPCBpbmZvLT5udW1fcmVxdWVzdHM7IGkrKykKCXsKCQlpZiAoaSA9 PSBjb21wbGV0ZWQpCgkJCWNvbnRpbnVlOwoJCW5ld19yZXF1ZXN0c1tpbmRleCsrXSA9IGluZm8t PnJlcVtpXTsKCX0KCglwcmludGYoInJlcXVlc3QgJWQgY29tcGxldGVkLlxuIiwgY29tcGxldGVk KTsKICAgIC8vbm93IHdlIHdhaXQgZm9yIHRoZSByZXN0IHdoaWxlIHdlJ3JlIGNhbmNlbGxpbmcK ICAgIAogICAgTVBJOjpSZXF1ZXN0OjpXYWl0YWxsKGluZm8tPm51bV9yZXF1ZXN0cyAtIDEsIG5l d19yZXF1ZXN0cyk7CiAgICBwcmludGYoImZpbmlzaGVkIHdhaXRpbmcgZm9yICVkIHJlcXVlc3Rz XG4iLCBpbmZvLT5udW1fcmVxdWVzdHMpOwogCXJldHVybiAwOwp9CgoKaW50IG1haW4oaW50IGFy Z2MsIGNoYXIqIGFyZ3ZbXSkKewogICAgTVBJOjpJbml0X3RocmVhZChNUElfVEhSRUFEX01VTFRJ UExFKTsKCiAgICAvL2luaXRpYWxpemUgdGhyZWFkcwogICAgOjpwdGhyZWFkX2F0dHJfdAkJdGhy ZWFkX2F0dHI7CiAgICA6OnNjaGVkX3BhcmFtICAgICAgICAgICBzY2hlZHVsZXJfcGFyYW07Cgk6 OnB0aHJlYWRfYXR0cl9pbml0KCZ0aHJlYWRfYXR0cik7Cgk6OnB0aHJlYWRfYXR0cl9nZXRzY2hl ZHBhcmFtKCZ0aHJlYWRfYXR0ciwgJnNjaGVkdWxlcl9wYXJhbSk7Cgk6OnB0aHJlYWRfYXR0cl9z ZXRzY29wZSgmdGhyZWFkX2F0dHIsIFBUSFJFQURfU0NPUEVfU1lTVEVNKTsKCWludCBPbGRUeXBl OwoJOjpwdGhyZWFkX3NldGNhbmNlbHR5cGUoUFRIUkVBRF9DQU5DRUxfQVNZTkNIUk9OT1VTLCZP bGRUeXBlKTsKIAoJdW5zaWduZWQgaW50IHNsZWVwX3RpbWU7CgkKCWludCBudW1fcmVxdWVzdHM7 CglpZiAoYXJnYyA+IDEpIHsKCQludW1fcmVxdWVzdHMgPSBhdG9pKGFyZ3ZbMV0pOwoJfQoJZWxz ZSB7CgkJbnVtX3JlcXVlc3RzID0gMTA7Cgl9CglpZiAoYXJnYyA+IDIpIHsKCQlzbGVlcF90aW1l ID0gYXRvaShhcmd2WzJdKTsKCX0KCWVsc2UgewoJCXNsZWVwX3RpbWUgPSAxMDA7Cgl9CgoJaW50 IHJhbmsgPSBnQ29tbS5HZXRfcmFuaygpOwoJaW50IHR0bF9yZXF1ZXN0cyA9IG51bV9yZXF1ZXN0 cyAqIChnQ29tbS5HZXRfc2l6ZSgpIC0gMSk7CglpbnQgYW5zd2Vyc1t0dGxfcmVxdWVzdHNdOwoK CS8vcHJpbnRmKCJ0dGxfcmVxdWVzdHM9JWQgbnVtX3JlcXVlc3RzPSVkXG4iLCB0dGxfcmVxdWVz dHMsIG51bV9yZXF1ZXN0cyk7CgkJCglpZiAocmFuayA9PSAwKSB7CgkJLy9vbiBtYXN0ZXIgd2Ug c3RhcnQgYSBidW5jaCBvZiBsaXN0ZW5pbmcgcmVxdWVzdHMKCQlNUEk6OlJlcXVlc3QgICByZXF1 ZXN0c1t0dGxfcmVxdWVzdHNdOwoKCQkKCSAgICBwdGhyZWFkX3QgICBsaXN0ZW5lcl9pZDsKCSAg ICBUaHJlYWRJbmZvICBsaXN0ZW5lcl9pbmZvOwoJICAgIGxpc3RlbmVyX2luZm8ucmVxID0gcmVx dWVzdHM7CgkgICAgbGlzdGVuZXJfaW5mby5udW1fcmVxdWVzdHMgPSB0dGxfcmVxdWVzdHM7Cgkg ICAgLy9jcmVhdGUgcmVxdWVzdHMKCSAgICBmb3IgKGludCBpPTA7IGkgPCB0dGxfcmVxdWVzdHM7 IGkrKykKCSAgICB7CQoJICAgIAlhbnN3ZXJzW2ldID0gMDsgLy9jbGVhcgoJICAgIAlyZXF1ZXN0 c1tpXSA9IGdDb21tLklyZWN2KCZhbnN3ZXJzW2ldLCAxLCBNUElfSU5ULCBNUElfQU5ZX1NPVVJD RSwgTVBJX0FOWV9UQUcpOwoJICAgIH0KCSAgICAvL3N0YXJ0IGxpc3RlbmluZyBmb3IgcmVwbHkK CQk6OnB0aHJlYWRfY3JlYXRlKCZsaXN0ZW5lcl9pZCwgJnRocmVhZF9hdHRyLCBMaXN0ZW5lckZ1 bmN0aW9uLCAmbGlzdGVuZXJfaW5mbyk7CiAJCQogCQkvL3NldCBoYW5kbGVyIGp1c3QgZm9yIHRo aXMgY2FsbCBzbyBpdCB3b24ndCBhYm9ydCBpZgogCQkvL2EgcmVxdWVzdCBoYXMgY29tcGxldGVk CiAJCWdDb21tLlNldF9lcnJoYW5kbGVyKE1QSTo6RVJST1JTX1JFVFVSTik7CgkJZm9yIChpbnQg aSA9IDA7IGkgPCB0dGxfcmVxdWVzdHM7IGkrKykKCQl7CgkJCXJlcXVlc3RzW2ldLkNhbmNlbCgp OwoJCX0KCQkvL3JldHVybiB0byBkZXNpcmVkIGJlaGF2aW9yCgkJZ0NvbW0uU2V0X2VycmhhbmRs ZXIoTVBJOjpFUlJPUlNfQVJFX0ZBVEFMKTsKCQkKCQk6OnB0aHJlYWRfam9pbihsaXN0ZW5lcl9p ZCwgTlVMTCk7CgkJcHJpbnRmKCJyZXF1ZXN0IGNvbXBsZXRlXG4iKTsKCgl9CgllbHNlIHsKCQkv L3NsYXZlcyBqdXN0IHNlbmQgbnVtYmVyIG9mIHJlcXVlc3RlZCBtZXNzYWdlcyB0byBtYXN0ZXIK CQlpbnQgb2Zmc2V0ID0gbnVtX3JlcXVlc3RzICogKGdDb21tLkdldF9yYW5rKCkgLSAxKTsKCQkK CQlpbnQgYW5zd2Vyc1tudW1fcmVxdWVzdHNdOwoJCU1QSTo6UmVxdWVzdCByZXF1ZXN0c1tudW1f cmVxdWVzdHNdOwoJCQoJCS8vZ0NvbW0uQmFycmllcigpOwoJCQoJICAgIGZvciAoaW50IGk9MDsg aSA8IG51bV9yZXF1ZXN0czsgaSsrKQoJICAgIHsKCSAgICAJYW5zd2Vyc1tpXSA9IGkgKyBvZmZz ZXQ7CgkJCXJlcXVlc3RzW2ldID0gZ0NvbW0uSXNlbmQoJmFuc3dlcnNbaV0sIDEsIE1QSV9JTlQs IDAsIDApOwoJICAgIH0KCSAgICBwcmludGYoInJhbmsgJWQgd2FpdGluZyBvbiAlZCBzZW5kcyB0 byBjb21wbGV0ZS5cbiIsIHJhbmssIG51bV9yZXF1ZXN0cyk7CgkgICAgTVBJOjpSZXF1ZXN0OjpX YWl0YWxsKG51bV9yZXF1ZXN0cywgcmVxdWVzdHMpOwoJICAgIHByaW50Zigic2VudCBhbnN3ZXJz ICVkIHRocm91Z2ggJWQuXG4iLCBvZmZzZXQsIG9mZnNldCtudW1fcmVxdWVzdHMpOwoJfQogCgln Q29tbS5CYXJyaWVyKCk7CgkgCgkgCiAgICBNUEk6OkZpbmFsaXplKCk7CiAgIC8vIGNvdXQgPDwg ImZpbmFsaXplZCBNUEkiIDw8IGVuZGw7CgogICAgcmV0dXJuIDA7Cn0KCg== ------_=_NextPart_001_01C83589.108AC65E--