Message 313797 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	cheryl.sabella, ezio.melotti, serhiy.storchaka, steve, terry.reedy, vstinner
Date	2018-03-14.01:08:52
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1520989738.02.0.467229070634.issue32987@psf.upfronthosting.co.za>
In-reply-to

Content
I think the issues are slightly different. #12486 is about the awkwardness of the API. This is about a false error after jumping through the hoops, which I think Steve B did correctly. Following the link, the Other_ID_Continue chars are 00B7 ; Other_ID_Continue # Po MIDDLE DOT 0387 ; Other_ID_Continue # Po GREEK ANO TELEIA 1369..1371 ; Other_ID_Continue # No [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE 19DA ; Other_ID_Continue # No NEW TAI LUE THAM DIGIT ONE # Total code points: 12 The 2 Po chars fail, the 2 No chars work. After looking at the tokenize module, I believe the problem is the re for Name is r'\w+' and the Po chars are not seen as \w word characters. >>> r = re.compile(r'\w+', re.U) >>> re.match(r, 'ab\u0387cd') <re.Match object; span=(0, 2), match='ab'> I don't know if the bug is a too narrow definition of \w in the re module("most characters that can be part of a word in any language, as well as numbers and the underscore") or of Name in the tokenize module. Before patching anything, I would like to know if the 2 Po Other chars are the only 2 not matched by \w. Unless someone has done so already, at least a sample of chars from each category included in the definition of 'identifier' should be tested.

I think the issues are slightly different.  #12486 is about the awkwardness of the API.  This is about a false error after jumping through the hoops, which I think Steve B did correctly.

Following the link, the Other_ID_Continue chars are

00B7          ; Other_ID_Continue # Po       MIDDLE DOT
0387          ; Other_ID_Continue # Po       GREEK ANO TELEIA
1369..1371    ; Other_ID_Continue # No   [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
19DA          ; Other_ID_Continue # No       NEW TAI LUE THAM DIGIT ONE

# Total code points: 12

The 2 Po chars fail, the 2 No chars work.  After looking at the tokenize module, I believe the problem is the re for Name is r'\w+' and the Po chars are not seen as \w word characters.

>>> r = re.compile(r'\w+', re.U)  
>>> re.match(r, 'ab\u0387cd')
<re.Match object; span=(0, 2), match='ab'>

I don't know if the bug is a too narrow definition of \w in the re module("most characters that can be part of a word in any language, as well as numbers and the underscore") or of Name in the tokenize module.

Before patching anything, I would like to know if the 2 Po Other chars are the only 2 not matched by \w.  Unless someone has done so already, at least a sample of chars from each category included in the definition of 'identifier' should be tested.

History
Date	User	Action	Args
2018-03-14 01:09:13	terry.reedy	set	recipients: + terry.reedy, vstinner, ezio.melotti, serhiy.storchaka, cheryl.sabella, steve
2018-03-14 01:08:58	terry.reedy	set	messageid: <1520989738.02.0.467229070634.issue32987@psf.upfronthosting.co.za>
2018-03-14 01:08:57	terry.reedy	link	issue32987 messages
2018-03-14 01:08:52	terry.reedy	create