python re.finditer returns a null string when expecting a None result

arbiel · (This post was last modified: Jan-07-2026, 07:36 PM by Gribouillis.)

Hi

I'm presently writing a python script to create html pages from entries extracted from a greek to french dictionary. Those entries contain short references to authors and their writing, which I want to replace by their meaning.

I'm using re.finditer to find the greek exemples and the references I'm looking for. It returns null string when I'm expecting None.

Here is an exemple of what is going on.

#!/usr/bin/env python3.8
# -*- coding: utf-8 -*-
#  

import sys
import os
from PyQt5 import QtWidgets
from PyQt5.QtWidgets import QApplication
import re

class globl () :
	def __init__(gbl):
		gbl.filtreCitation = filtrecitation()
		gbl.corps = ""
		gbl.auteur = None
		gbl.refauteur = None
		gbl.refouvr = None

def filtrecitation () :
#	filtre=r"((?P<citation>([Ͱ-Ͽἀ-῾]+([',;:]? ?)?)+)|(?P<refauteur>([A-ZÀÁÉÈÆŒ]+[.]( [A-ZÀÁÉÈÆŒ]{2,}[.])?))(\s+?P<refouvr>([A-ZÀÁÉÆÈŒ][a-zàáéèæœ]*[.]){1,2}(\s+[A-ZÀÁÉÆÈŒ][a-zàáéèæœ]+[.])?)?)"
	filtre=r"""((?P<citation>(
								\s*[Ͱ-Ͽἀ-῾]+			# tester la présence d'un mot composé le lettres grecques
								([',;:]?)?		# éventuellement terminé par un de ces caractères et peut-être d'espaces
								)+					# répétés pour former
								)					# une phrase
								|					# ou réference, toute référence est suivie de <.>
								(?P<refauteur>		# en une ou plusieurs parties
								([A-Z][.]\s+[A-ZÀÁÉÈÆŒ]{2,}[.]) 	# une référence en deux parties : une lettre et au moins 2 lettres
								|					# ou une référence en une seule partie
								([A-ZÀÁÉÈÆŒ]+[.])?
								)\s*				# la référence à l'auteur peut être absente
													# suivie de la référence à une œuvre
								(?P<refouvr>(
								[A-Z][.]\s*[A-Z][.]|[A-Z][a-zàáéèæœ]+[.])	# une majuscule suivie d'une majuscule ou de plusieurs minuscules
								)?)
								"""
	return re.compile(filtre,re.X)	

def balisePara (ligne):
	print(ligne)
	for m in re.finditer(gbl.filtreCitation,ligne) :
		print(m,m.groupdict())
	return()


def yml_vocab(leTexte):

	for ligne in leTexte.split('\n') :
		paraHtml=balisePara(ligne)		
	return()


grec="""τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
"""
	
def principale() :
	global gbl
	gbl=globl()
	yml_vocab(grec)
	return
	
	
if __name__ == '__main__':

	principale()

Here are a few lines of the result of this exemple script :

Output:arbiel@arbiel-NJ5x-NJ7xLU:~$ python '/home/arbiel/Bureau/Grec/étude_grec/ap_test_filtrecitation.py'
τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
<re.Match object; span=(0, 5), match='τελέω'> {'citation': 'τελέω', 'refauteur': None, 'refouvr': None}
<re.Match object; span=(5, 5), match=''> {'citation': None, 'refauteur': '', 'refouvr': None}
<re.Match object; span=(6, 7), match='ῶ'> {'citation': 'ῶ', 'refauteur': None, 'refouvr': None}
<re.Match object; span=(7, 8), match=' '> {'citation': None, 'refauteur': '', 'refouvr': None}
<re.Match object; span=(8, 8), match=''> {'citation': None, 'refauteur': '', 'refouvr': None}

I do not understand.

Arbiel

Gribouillis write Jan-07-2026, 07:36 PM:
fixed BBcode

**Gribouillis** · (This post was last modified: Jan-07-2026, 07:56 PM by Gribouillis.)

Hi?

Independently from the current implementation, what citations, authors and references should the program detect in the text

"""τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
"""

?

Also what don't you understand exactly in the program's current output?

stellacaroline5 · Jan-08-2026, 10:48 AM

Thank you for the Information

stellacaroline5 · Jan-08-2026, 10:51 AM

The text contains standard classical Greek lexicon references. The program should detect authors such as Plato (PLAT.), Xenophon (XEN.), and Demosthenes (Dém.), along with works like Plato’s Cratylus (Crat.) and Republic (Rsp.). It should also recognize citation locations such as 384a and 378b, which are Stephanus references. Grammatical abbreviations (e.g., Impf., ao., pf., pass., etc.) are not citations and should be ignored.

arbiel · Jan-08-2026, 03:56 PM

Hi Gribouillis

I beg your pardon, I was a little short of time when I posted my message.

As stellacaroline5 precises, the script is intended to recognise references to authors, as PLAT for Platon, XEN for Xénophon and DÉM for Démosthene. It must also recognise greel words. This is suppose to be done through the regular expression :

    filtre=r"""((?P<citation>(
                                \s*[Ͱ-Ͽἀ-῾]+            # tester la présence d'un mot composé le lettres grecques
                                ([',;:]?)?      # éventuellement terminé par un de ces caractères et peut-être d'espaces
                                )+                  # répétés pour former
                                )                   # une phrase
                                |                   # ou réference, toute référence est suivie de <.>
                                (?P<refauteur>        # en une ou plusieurs parties
                                ([A-Z][.]\s+[A-ZÀÁÉÈÆŒ]{2,}[.])     # une référence en deux parties : une lettre et au moins 2 lettres
                                |                   # ou une référence en une seule partie
                                ([A-ZÀÁÉÈÆŒ]+[.])?
                                )\s*                # la référence à l'auteur peut être absente
                                                    # suivie de la référence à une œuvre
                                (?P<refouvr>(
                                [A-Z][.]\s*[A-Z][.]|[A-Z][a-zàáéèæœ]+[.])   # une majuscule suivie d'une majuscule ou de plusieurs minuscules
                                )?)
                                """

As a result of applying this filter to text containing a succession of greek words "?P<citation>(\s*[Ͱ-Ͽἀ-῾]+([',;:]?)?)+", abreviation of greek authors, in two parts (a singel latin capital letter, a dot, and one or several latin capital letters and a dot) or in one part (several latin capital letters followed by a dot), " ?P<refauteur>([A-Z][.]\s+[A-ZÀÁÉÈÆŒ]{2,}[.])|([A-ZÀÁÉÈÆŒ]+[.])? )\s*" with a possible reference to a work of this author.

The result of applying the filter is a dictionany whose keys are "citation", "refauteur" and "refouvr". When the analysed text does not match parts of the filter, I am expecting a None suc as

Output:
<re.Match object; span=(6, 7), match='ῶ'> {'citation': 'ῶ', 'refauteur': None, 'refouvr': None}

So, I don't understand why, when the analyzed text is either a "-" or a space, I get such outputs as :

Output:<re.Match object; span=(5, 5), match=''> {'citation': None, 'refauteur': '', 'refouvr': None}
<re.Match object; span=(7, 8), match=' '> {'citation': None, 'refauteur': '', 'refouvr': None}
<re.Match object; span=(8, 8), match=''> {'citation': None, 'refauteur': '', 'refouvr': None}

where refauteur is null string and not None as are citation and refouvr.

I don't see what is wrong in my regular expression.

However, I'm going to include the "-" in the possible greek letters. In "τελέω-ῶ", this "-" means that the real verb "τελῶ" results from the contraction of έω into ῶ. But the issue remains for single spaces.

Arbiel

**Gribouillis** · (This post was last modified: Jan-08-2026, 06:16 PM by Gribouillis.)

(Jan-08-2026, 03:56 PM)arbiel Wrote: I don't see what is wrong in my regular expression.

To summarize the problem as I understand it, let's call S a greek sentence, A an author and R a book reference. Your regular expression means S|A?R?, that is to say a greek sentence or an optional author followed by an optional reference. The problem is that the group A?R? has many empty matches. I suggest to replace it by a pattern such as S|AR?|R which would eliminate all empty matches. Only you would have two 'refouvr' groups instead of one.

Here is a modified implementation

#!/usr/bin/env python3.8
# -*- coding: utf-8 -*-
#

import sys
import os
from PyQt5 import QtWidgets
from PyQt5.QtWidgets import QApplication
import re

class globl () :
    def __init__(gbl):
        gbl.filtreCitation = filtrecitation()
        gbl.corps = ""
        gbl.auteur = None
        gbl.refauteur = None
        gbl.refouvr = None

def filtrecitation () :

    sentence = r"(?:(?:\s*[Ͱ-Ͽἀ-῾-]+[',;:]?)+)"
    author = r"(?:(?:[A-Z][.]\s+[A-ZÀÁÉÈÆŒ]{2,}[.])|(?:[A-ZÀÁÉÈÆŒ]+[.])\s*)"
    refouvr = r"(?:[A-Z][.]\s*[A-Z][.]|[A-Z][a-zàáéèæœ]+[.])"

    filtre = f"(?P<citation>{sentence})|(?P<refauteur>{author})(?P<refouvr1>{refouvr})?|(?P<refouvr2>{refouvr})"
    return re.compile(filtre,re.X)

def balisePara (ligne):
    print(ligne)
    for m in re.finditer(gbl.filtreCitation,ligne) :
        print(m,m.groupdict())
    return()


def yml_vocab(leTexte):

    for ligne in leTexte.split('\n') :
        paraHtml=balisePara(ligne)
    return()


grec="""τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
"""

def principale() :
    global gbl
    gbl=globl()
    yml_vocab(grec)
    return


if __name__ == '__main__':

    principale()

And the result

Output:λ python paillasse/pf/arbiel.py
τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσ
 μαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
<re.Match object; span=(0, 7), match='τελέω-ῶ'> {'citation': 'τελέω-ῶ', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(9, 14), match='Impf.'> {'citation': None, 'refauteur': None, 'refouvr1': None, 'refouvr2': 'Impf.'}
<re.Match object; span=(14, 23), match=' ἐτέλουν,'> {'citation': ' ἐτέλουν,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(26, 34), match=' τελέσω,'> {'citation': ' τελέσω,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(39, 44), match=' τελῶ'> {'citation': ' τελῶ', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(50, 59), match=' ἐτέλεσα,'> {'citation': ' ἐτέλεσα,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(63, 72), match=' τετέλεκα'> {'citation': ' τετέλεκα', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(83, 97), match=' τελεσθήσομαι,'> {'citation': ' τελεσθήσομαι,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(101, 112), match=' ἐτελέσθην,'> {'citation': ' ἐτελέσθην,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(116, 127), match=' τετέλεσμαι'> {'citation': ' τετέλεσμαι', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
<re.Match object; span=(3, 14), match='PLAT. Crat.'> {'citation': None, 'refauteur': 'PLAT. ', 'refouvr1': 'Crat.', 'refouvr2': None}
<re.Match object; span=(22, 26), match='Rsp.'> {'citation': None, 'refauteur': None, 'refouvr1': None, 'refouvr2': 'Rsp.'}
<re.Match object; span=(41, 46), match='XÉN. '> {'citation': None, 'refauteur': 'XÉN. ', 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(46, 51), match='DÉM. '> {'citation': None, 'refauteur': 'DÉM. ', 'refouvr1': None, 'refouvr2': None}

arbiel · Jan-08-2026, 07:15 PM

Great.

However, I will replace S|AR?|R by S|A|R as I don't mind separating the A from the R. This will save me to have refouvr1 and refouvr2.

I also will add F (S|A|R|F), for french sentences, to avoid having as many matches as single latin characters.

I did not know the "f" fonction and I understand it subsitutes the variables (enclosed in curly braces) by their values.

Thanks a lot Gribouillis

Have a nice day.

Arbiel

***snippsat*** · Jan-09-2026, 06:28 PM

If use module regex it support Unicode properties like \p{Greek} which make it easier.
Then you can tokenize more cleanly:

import regex as re

TOK = re.compile(r"""
(?P<GREEK>\p{Greek}+)
|(?P<AUTHOR>(?:[A-Z]\.\s+[A-ZÀÁÉÈÆŒ]{2,}\.|[A-ZÀÁÉÈÆŒ]+\.) )
|(?P<WORK>(?:[A-Z]\.\s*[A-Z]\.|[A-Z][a-zàáéèæœ]+\.) )
|(?P<SPACE>\s+)
|(?P<OTHER>.)
""", re.X)

author_map = {"PLAT.": "Platon", "XÉN.": "Xénophon"}  # etc.

text = """τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
"""

out = []
for m in TOK.finditer(text):
    kind = m.lastgroup
    val = m.group()
    if kind == "AUTHOR":
        out.append(author_map.get(val.strip(), val))
    else:
        out.append(val)
result = "".join(out)
print(result)

Output:τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. Platon Crat. 384 a, Rsp. 378 b, etc. ; Xénophon DÉM. etc.

Pedroski55 · Jan-11-2026, 02:57 AM

You don't need a regex to replace the author names for their full names:

text = """τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc.; XÉN.  Vat.  385 c, Rsq.  379 d, etc.; DÉM.  Bat.  386 d, Rsq.  380 e; 
"""
text_list = text.split()
author_map = {"PLAT.": "Platon", "XÉN.": "Xénophon", "DÉM.": "Démosthene"}
for i in range(len(text_list)):    
    for key in author_map.keys():
        if  key == text_list[i]:
            text_list[i] = author_map[key]

res = ' '.join(text_list)

Gives res:

'τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ; f. Platon Crat. 384 a, Rsp. 378 b, etc. ; Xénophon Vat. 385 c, Rsq. 379 d, etc. ; Démosthene Vat. 386 d, Rsq. 380 e;'

If I knew exactly what a citation should look like, we could find that too! How do we know where a citation starts and ends?

How to identify a citation?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Error on import: SyntaxError: source code string cannot contain null bytes	kirkwilliams2049	10	26,949	May-26-2025, 01:55 PM Last Post: deanhystad
	How returns behave in a function with multiple returns?	khasbay	1	1,264	May-19-2024, 08:48 AM Last Post: deanhystad
	How to express null value	klatlap	3	2,855	Mar-25-2023, 10:40 AM Last Post: klatlap
	JSONDecodeError: Expecting value	mehtamonita	1	3,171	Mar-07-2022, 04:24 PM Last Post: bowlofred
	Pyspark SQL Error - mismatched input 'FROM' expecting <EOF>	Ariean	3	56,634	Nov-20-2020, 03:49 PM Last Post: Ariean
	Multiple conditions, one is null	moralear27	1	3,328	Sep-13-2020, 06:11 AM Last Post: scidam
	I didnt get the NULL values	salwa17	0	2,343	Jul-10-2020, 02:54 PM Last Post: salwa17
	Annualised returns in python. Urgent request	shivamdang	2	3,519	Apr-12-2020, 07:37 AM Last Post: shivamdang
	TypeError: size; expecting a recognized type filling string dict	a11_m11	0	3,880	Feb-10-2020, 08:26 AM Last Post: a11_m11
	Python function returns inconsistent results	bluethundr	4	5,209	Dec-21-2019, 02:11 AM Last Post: stullis

python re.finditer returns a null string when expecting a None result

User Panel Messages

Announcements