Python Forum
python re.finditer returns a null string when expecting a None result
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
python re.finditer returns a null string when expecting a None result
#1
Hi

I'm presently writing a python script to create html pages from entries extracted from a greek to french dictionary. Those entries contain short references to authors and their writing, which I want to replace by their meaning.

I'm using re.finditer to find the greek exemples and the references I'm looking for. It returns null string when I'm expecting None.

Here is an exemple of what is going on.

#!/usr/bin/env python3.8
# -*- coding: utf-8 -*-
#  

import sys
import os
from PyQt5 import QtWidgets
from PyQt5.QtWidgets import QApplication
import re

class globl () :
	def __init__(gbl):
		gbl.filtreCitation = filtrecitation()
		gbl.corps = ""
		gbl.auteur = None
		gbl.refauteur = None
		gbl.refouvr = None

def filtrecitation () :
#	filtre=r"((?P<citation>([Ͱ-Ͽἀ-῾]+([',;:]? ?)?)+)|(?P<refauteur>([A-ZÀÁÉÈÆŒ]+[.]( [A-ZÀÁÉÈÆŒ]{2,}[.])?))(\s+?P<refouvr>([A-ZÀÁÉÆÈŒ][a-zàáéèæœ]*[.]){1,2}(\s+[A-ZÀÁÉÆÈŒ][a-zàáéèæœ]+[.])?)?)"
	filtre=r"""((?P<citation>(
								\s*[Ͱ-Ͽἀ-῾]+			# tester la présence d'un mot composé le lettres grecques
								([',;:]?)?		# éventuellement terminé par un de ces caractères et peut-être d'espaces
								)+					# répétés pour former
								)					# une phrase
								|					# ou réference, toute référence est suivie de <.>
								(?P<refauteur>		# en une ou plusieurs parties
								([A-Z][.]\s+[A-ZÀÁÉÈÆŒ]{2,}[.]) 	# une référence en deux parties : une lettre et au moins 2 lettres
								|					# ou une référence en une seule partie
								([A-ZÀÁÉÈÆŒ]+[.])?
								)\s*				# la référence à l'auteur peut être absente
													# suivie de la référence à une œuvre
								(?P<refouvr>(
								[A-Z][.]\s*[A-Z][.]|[A-Z][a-zàáéèæœ]+[.])	# une majuscule suivie d'une majuscule ou de plusieurs minuscules
								)?)
								"""
	return re.compile(filtre,re.X)	

def balisePara (ligne):
	print(ligne)
	for m in re.finditer(gbl.filtreCitation,ligne) :
		print(m,m.groupdict())
	return()


def yml_vocab(leTexte):

	for ligne in leTexte.split('\n') :
		paraHtml=balisePara(ligne)		
	return()


grec="""τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
"""
	
def principale() :
	global gbl
	gbl=globl()
	yml_vocab(grec)
	return
	
	
if __name__ == '__main__':

	principale()
	
Here are a few lines of the result of this exemple script :

Output:
arbiel@arbiel-NJ5x-NJ7xLU:~$ python '/home/arbiel/Bureau/Grec/étude_grec/ap_test_filtrecitation.py' τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ; <re.Match object; span=(0, 5), match='τελέω'> {'citation': 'τελέω', 'refauteur': None, 'refouvr': None} <re.Match object; span=(5, 5), match=''> {'citation': None, 'refauteur': '', 'refouvr': None} <re.Match object; span=(6, 7), match='ῶ'> {'citation': 'ῶ', 'refauteur': None, 'refouvr': None} <re.Match object; span=(7, 8), match=' '> {'citation': None, 'refauteur': '', 'refouvr': None} <re.Match object; span=(8, 8), match=''> {'citation': None, 'refauteur': '', 'refouvr': None}
I do not understand.

Arbiel
Gribouillis write Jan-07-2026, 07:36 PM:
fixed BBcode
stellacaroline5 likes this post
using Ubuntu 22.04.5 LTS, Python 3.10.12
having substituted «https://www.lilo.org/fr/» to google, «https://protonmail.com/» to any other unsafe mail service and bépo to azerty (french keyboard layouts)
Reply
#2
Hi?

Independently from the current implementation, what citations, authors and references should the program detect in the text
"""τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
"""
?

Also what don't you understand exactly in the program's current output?
« We can solve any problem by introducing an extra level of indirection »
Reply
#3
Thank you for the Information
Reply
#4
The text contains standard classical Greek lexicon references. The program should detect authors such as Plato (PLAT.), Xenophon (XEN.), and Demosthenes (Dém.), along with works like Plato’s Cratylus (Crat.) and Republic (Rsp.). It should also recognize citation locations such as 384a and 378b, which are Stephanus references. Grammatical abbreviations (e.g., Impf., ao., pf., pass., etc.) are not citations and should be ignored.
Reply
#5
Hi Gribouillis

I beg your pardon, I was a little short of time when I posted my message.

As stellacaroline5 precises, the script is intended to recognise references to authors, as PLAT for Platon, XEN for Xénophon and DÉM for Démosthene. It must also recognise greel words. This is suppose to be done through the regular expression :

    filtre=r"""((?P<citation>(
                                \s*[Ͱ-Ͽἀ-῾]+            # tester la présence d'un mot composé le lettres grecques
                                ([',;:]?)?      # éventuellement terminé par un de ces caractères et peut-être d'espaces
                                )+                  # répétés pour former
                                )                   # une phrase
                                |                   # ou réference, toute référence est suivie de <.>
                                (?P<refauteur>        # en une ou plusieurs parties
                                ([A-Z][.]\s+[A-ZÀÁÉÈÆŒ]{2,}[.])     # une référence en deux parties : une lettre et au moins 2 lettres
                                |                   # ou une référence en une seule partie
                                ([A-ZÀÁÉÈÆŒ]+[.])?
                                )\s*                # la référence à l'auteur peut être absente
                                                    # suivie de la référence à une œuvre
                                (?P<refouvr>(
                                [A-Z][.]\s*[A-Z][.]|[A-Z][a-zàáéèæœ]+[.])   # une majuscule suivie d'une majuscule ou de plusieurs minuscules
                                )?)
                                """
As a result of applying this filter to text containing a succession of greek words "?P<citation>(\s*[Ͱ-Ͽἀ-῾]+([',;:]?)?)+", abreviation of greek authors, in two parts (a singel latin capital letter, a dot, and one or several latin capital letters and a dot) or in one part (several latin capital letters followed by a dot), " ?P<refauteur>([A-Z][.]\s+[A-ZÀÁÉÈÆŒ]{2,}[.])|([A-ZÀÁÉÈÆŒ]+[.])? )\s*" with a possible reference to a work of this author.

The result of applying the filter is a dictionany whose keys are "citation", "refauteur" and "refouvr". When the analysed text does not match parts of the filter, I am expecting a None suc as
Output:
<re.Match object; span=(6, 7), match='ῶ'> {'citation': 'ῶ', 'refauteur': None, 'refouvr': None}
So, I don't understand why, when the analyzed text is either a "-" or a space, I get such outputs as :
Output:
<re.Match object; span=(5, 5), match=''> {'citation': None, 'refauteur': '', 'refouvr': None} <re.Match object; span=(7, 8), match=' '> {'citation': None, 'refauteur': '', 'refouvr': None} <re.Match object; span=(8, 8), match=''> {'citation': None, 'refauteur': '', 'refouvr': None}
where refauteur is null string and not None as are citation and refouvr.

I don't see what is wrong in my regular expression.

However, I'm going to include the "-" in the possible greek letters. In "τελέω-ῶ", this "-" means that the real verb "τελῶ" results from the contraction of έω into ῶ. But the issue remains for single spaces.

Arbiel
using Ubuntu 22.04.5 LTS, Python 3.10.12
having substituted «https://www.lilo.org/fr/» to google, «https://protonmail.com/» to any other unsafe mail service and bépo to azerty (french keyboard layouts)
Reply
#6
(Jan-08-2026, 03:56 PM)arbiel Wrote: I don't see what is wrong in my regular expression.
To summarize the problem as I understand it, let's call S a greek sentence, A an author and R a book reference. Your regular expression means S|A?R?, that is to say a greek sentence or an optional author followed by an optional reference. The problem is that the group A?R? has many empty matches. I suggest to replace it by a pattern such as S|AR?|R which would eliminate all empty matches. Only you would have two 'refouvr' groups instead of one.

Here is a modified implementation
#!/usr/bin/env python3.8
# -*- coding: utf-8 -*-
#

import sys
import os
from PyQt5 import QtWidgets
from PyQt5.QtWidgets import QApplication
import re

class globl () :
    def __init__(gbl):
        gbl.filtreCitation = filtrecitation()
        gbl.corps = ""
        gbl.auteur = None
        gbl.refauteur = None
        gbl.refouvr = None

def filtrecitation () :

    sentence = r"(?:(?:\s*[Ͱ-Ͽἀ-῾-]+[',;:]?)+)"
    author = r"(?:(?:[A-Z][.]\s+[A-ZÀÁÉÈÆŒ]{2,}[.])|(?:[A-ZÀÁÉÈÆŒ]+[.])\s*)"
    refouvr = r"(?:[A-Z][.]\s*[A-Z][.]|[A-Z][a-zàáéèæœ]+[.])"

    filtre = f"(?P<citation>{sentence})|(?P<refauteur>{author})(?P<refouvr1>{refouvr})?|(?P<refouvr2>{refouvr})"
    return re.compile(filtre,re.X)

def balisePara (ligne):
    print(ligne)
    for m in re.finditer(gbl.filtreCitation,ligne) :
        print(m,m.groupdict())
    return()


def yml_vocab(leTexte):

    for ligne in leTexte.split('\n') :
        paraHtml=balisePara(ligne)
    return()


grec="""τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
"""

def principale() :
    global gbl
    gbl=globl()
    yml_vocab(grec)
    return


if __name__ == '__main__':

    principale()
And the result
Output:
λ python paillasse/pf/arbiel.py τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσ μαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ; <re.Match object; span=(0, 7), match='τελέω-ῶ'> {'citation': 'τελέω-ῶ', 'refauteur': None, 'refouvr1': None, 'refouvr2': None} <re.Match object; span=(9, 14), match='Impf.'> {'citation': None, 'refauteur': None, 'refouvr1': None, 'refouvr2': 'Impf.'} <re.Match object; span=(14, 23), match=' ἐτέλουν,'> {'citation': ' ἐτέλουν,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None} <re.Match object; span=(26, 34), match=' τελέσω,'> {'citation': ' τελέσω,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None} <re.Match object; span=(39, 44), match=' τελῶ'> {'citation': ' τελῶ', 'refauteur': None, 'refouvr1': None, 'refouvr2': None} <re.Match object; span=(50, 59), match=' ἐτέλεσα,'> {'citation': ' ἐτέλεσα,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None} <re.Match object; span=(63, 72), match=' τετέλεκα'> {'citation': ' τετέλεκα', 'refauteur': None, 'refouvr1': None, 'refouvr2': None} <re.Match object; span=(83, 97), match=' τελεσθήσομαι,'> {'citation': ' τελεσθήσομαι,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None} <re.Match object; span=(101, 112), match=' ἐτελέσθην,'> {'citation': ' ἐτελέσθην,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None} <re.Match object; span=(116, 127), match=' τετέλεσμαι'> {'citation': ' τετέλεσμαι', 'refauteur': None, 'refouvr1': None, 'refouvr2': None} f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc. <re.Match object; span=(3, 14), match='PLAT. Crat.'> {'citation': None, 'refauteur': 'PLAT. ', 'refouvr1': 'Crat.', 'refouvr2': None} <re.Match object; span=(22, 26), match='Rsp.'> {'citation': None, 'refauteur': None, 'refouvr1': None, 'refouvr2': 'Rsp.'} <re.Match object; span=(41, 46), match='XÉN. '> {'citation': None, 'refauteur': 'XÉN. ', 'refouvr1': None, 'refouvr2': None} <re.Match object; span=(46, 51), match='DÉM. '> {'citation': None, 'refauteur': 'DÉM. ', 'refouvr1': None, 'refouvr2': None}
« We can solve any problem by introducing an extra level of indirection »
Reply
#7
Great.

However, I will replace S|AR?|R by S|A|R as I don't mind separating the A from the R. This will save me to have refouvr1 and refouvr2.

I also will add F (S|A|R|F), for french sentences, to avoid having as many matches as single latin characters.

I did not know the "f" fonction and I understand it subsitutes the variables (enclosed in curly braces) by their values.

Thanks a lot Gribouillis

Have a nice day.

Arbiel
using Ubuntu 22.04.5 LTS, Python 3.10.12
having substituted «https://www.lilo.org/fr/» to google, «https://protonmail.com/» to any other unsafe mail service and bépo to azerty (french keyboard layouts)
Reply
#8
If use module regex it support Unicode properties like \p{Greek} which make it easier.
Then you can tokenize more cleanly:
import regex as re

TOK = re.compile(r"""
(?P<GREEK>\p{Greek}+)
|(?P<AUTHOR>(?:[A-Z]\.\s+[A-ZÀÁÉÈÆŒ]{2,}\.|[A-ZÀÁÉÈÆŒ]+\.) )
|(?P<WORK>(?:[A-Z]\.\s*[A-Z]\.|[A-Z][a-zàáéèæœ]+\.) )
|(?P<SPACE>\s+)
|(?P<OTHER>.)
""", re.X)

author_map = {"PLAT.": "Platon", "XÉN.": "Xénophon"}  # etc.

text = """τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
"""

out = []
for m in TOK.finditer(text):
    kind = m.lastgroup
    val = m.group()
    if kind == "AUTHOR":
        out.append(author_map.get(val.strip(), val))
    else:
        out.append(val)
result = "".join(out)
print(result)
Output:
τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ; f. Platon Crat. 384 a, Rsp. 378 b, etc. ; Xénophon DÉM. etc.
stellacaroline5 and Pedroski55 like this post
Reply
#9
You don't need a regex to replace the author names for their full names:

text = """τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc.; XÉN.  Vat.  385 c, Rsq.  379 d, etc.; DÉM.  Bat.  386 d, Rsq.  380 e; 
"""
text_list = text.split()
author_map = {"PLAT.": "Platon", "XÉN.": "Xénophon", "DÉM.": "Démosthene"}
for i in range(len(text_list)):    
    for key in author_map.keys():
        if  key == text_list[i]:
            text_list[i] = author_map[key]

res = ' '.join(text_list)
Gives res:

'τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ; f. Platon Crat. 384 a, Rsp. 378 b, etc. ; Xénophon Vat. 385 c, Rsq. 379 d, etc. ; Démosthene Vat. 386 d, Rsq. 380 e;'
If I knew exactly what a citation should look like, we could find that too! How do we know where a citation starts and ends?

How to identify a citation?
stellacaroline5 likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Error on import: SyntaxError: source code string cannot contain null bytes kirkwilliams2049 10 26,949 May-26-2025, 01:55 PM
Last Post: deanhystad
  How returns behave in a function with multiple returns? khasbay 1 1,264 May-19-2024, 08:48 AM
Last Post: deanhystad
  How to express null value klatlap 3 2,855 Mar-25-2023, 10:40 AM
Last Post: klatlap
  JSONDecodeError: Expecting value mehtamonita 1 3,171 Mar-07-2022, 04:24 PM
Last Post: bowlofred
  Pyspark SQL Error - mismatched input 'FROM' expecting <EOF> Ariean 3 56,634 Nov-20-2020, 03:49 PM
Last Post: Ariean
  Multiple conditions, one is null moralear27 1 3,328 Sep-13-2020, 06:11 AM
Last Post: scidam
  I didnt get the NULL values salwa17 0 2,343 Jul-10-2020, 02:54 PM
Last Post: salwa17
  Annualised returns in python. Urgent request shivamdang 2 3,519 Apr-12-2020, 07:37 AM
Last Post: shivamdang
  TypeError: size; expecting a recognized type filling string dict a11_m11 0 3,880 Feb-10-2020, 08:26 AM
Last Post: a11_m11
  Python function returns inconsistent results bluethundr 4 5,209 Dec-21-2019, 02:11 AM
Last Post: stullis

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020