Posts: 86
Threads: 35
Joined: Feb 2020
Jan-07-2026, 05:38 PM
(This post was last modified: Jan-07-2026, 07:36 PM by Gribouillis.)
Hi
I'm presently writing a python script to create html pages from entries extracted from a greek to french dictionary. Those entries contain short references to authors and their writing, which I want to replace by their meaning.
I'm using re.finditer to find the greek exemples and the references I'm looking for. It returns null string when I'm expecting None.
Here is an exemple of what is going on.
#!/usr/bin/env python3.8
# -*- coding: utf-8 -*-
#
import sys
import os
from PyQt5 import QtWidgets
from PyQt5.QtWidgets import QApplication
import re
class globl () :
def __init__(gbl):
gbl.filtreCitation = filtrecitation()
gbl.corps = ""
gbl.auteur = None
gbl.refauteur = None
gbl.refouvr = None
def filtrecitation () :
# filtre=r"((?P<citation>([Ͱ-Ͽἀ-῾]+([',;:]? ?)?)+)|(?P<refauteur>([A-ZÀÁÉÈÆŒ]+[.]( [A-ZÀÁÉÈÆŒ]{2,}[.])?))(\s+?P<refouvr>([A-ZÀÁÉÆÈŒ][a-zàáéèæœ]*[.]){1,2}(\s+[A-ZÀÁÉÆÈŒ][a-zàáéèæœ]+[.])?)?)"
filtre=r"""((?P<citation>(
\s*[Ͱ-Ͽἀ-῾]+ # tester la présence d'un mot composé le lettres grecques
([',;:]?)? # éventuellement terminé par un de ces caractères et peut-être d'espaces
)+ # répétés pour former
) # une phrase
| # ou réference, toute référence est suivie de <.>
(?P<refauteur> # en une ou plusieurs parties
([A-Z][.]\s+[A-ZÀÁÉÈÆŒ]{2,}[.]) # une référence en deux parties : une lettre et au moins 2 lettres
| # ou une référence en une seule partie
([A-ZÀÁÉÈÆŒ]+[.])?
)\s* # la référence à l'auteur peut être absente
# suivie de la référence à une œuvre
(?P<refouvr>(
[A-Z][.]\s*[A-Z][.]|[A-Z][a-zàáéèæœ]+[.]) # une majuscule suivie d'une majuscule ou de plusieurs minuscules
)?)
"""
return re.compile(filtre,re.X)
def balisePara (ligne):
print(ligne)
for m in re.finditer(gbl.filtreCitation,ligne) :
print(m,m.groupdict())
return()
def yml_vocab(leTexte):
for ligne in leTexte.split('\n') :
paraHtml=balisePara(ligne)
return()
grec="""τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
"""
def principale() :
global gbl
gbl=globl()
yml_vocab(grec)
return
if __name__ == '__main__':
principale()
Here are a few lines of the result of this exemple script :
Output: arbiel@arbiel-NJ5x-NJ7xLU:~$ python '/home/arbiel/Bureau/Grec/étude_grec/ap_test_filtrecitation.py'
τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
<re.Match object; span=(0, 5), match='τελέω'> {'citation': 'τελέω', 'refauteur': None, 'refouvr': None}
<re.Match object; span=(5, 5), match=''> {'citation': None, 'refauteur': '', 'refouvr': None}
<re.Match object; span=(6, 7), match='ῶ'> {'citation': 'ῶ', 'refauteur': None, 'refouvr': None}
<re.Match object; span=(7, 8), match=' '> {'citation': None, 'refauteur': '', 'refouvr': None}
<re.Match object; span=(8, 8), match=''> {'citation': None, 'refauteur': '', 'refouvr': None}
I do not understand.
Arbiel
stellacaroline5 likes this post
using Ubuntu 22.04.5 LTS, Python 3.10.12
having substituted «https://www.lilo.org/fr/» to google, «https://protonmail.com/» to any other unsafe mail service and bépo to azerty (french keyboard layouts)
Posts: 4,904
Threads: 79
Joined: Jan 2018
Jan-07-2026, 07:55 PM
(This post was last modified: Jan-07-2026, 07:56 PM by Gribouillis.)
Hi?
Independently from the current implementation, what citations, authors and references should the program detect in the text
"""τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
""" ?
Also what don't you understand exactly in the program's current output?
« We can solve any problem by introducing an extra level of indirection »
Posts: 2
Threads: 0
Joined: Dec 2025
Thank you for the Information
Posts: 2
Threads: 0
Joined: Dec 2025
The text contains standard classical Greek lexicon references. The program should detect authors such as Plato (PLAT.), Xenophon (XEN.), and Demosthenes (Dém.), along with works like Plato’s Cratylus (Crat.) and Republic (Rsp.). It should also recognize citation locations such as 384a and 378b, which are Stephanus references. Grammatical abbreviations (e.g., Impf., ao., pf., pass., etc.) are not citations and should be ignored.
Posts: 86
Threads: 35
Joined: Feb 2020
Hi Gribouillis
I beg your pardon, I was a little short of time when I posted my message.
As stellacaroline5 precises, the script is intended to recognise references to authors, as PLAT for Platon, XEN for Xénophon and DÉM for Démosthene. It must also recognise greel words. This is suppose to be done through the regular expression :
filtre=r"""((?P<citation>(
\s*[Ͱ-Ͽἀ-῾]+ # tester la présence d'un mot composé le lettres grecques
([',;:]?)? # éventuellement terminé par un de ces caractères et peut-être d'espaces
)+ # répétés pour former
) # une phrase
| # ou réference, toute référence est suivie de <.>
(?P<refauteur> # en une ou plusieurs parties
([A-Z][.]\s+[A-ZÀÁÉÈÆŒ]{2,}[.]) # une référence en deux parties : une lettre et au moins 2 lettres
| # ou une référence en une seule partie
([A-ZÀÁÉÈÆŒ]+[.])?
)\s* # la référence à l'auteur peut être absente
# suivie de la référence à une œuvre
(?P<refouvr>(
[A-Z][.]\s*[A-Z][.]|[A-Z][a-zàáéèæœ]+[.]) # une majuscule suivie d'une majuscule ou de plusieurs minuscules
)?)
"""As a result of applying this filter to text containing a succession of greek words "?P<citation>(\s*[Ͱ-Ͽἀ-῾]+([',;:]?)?)+", abreviation of greek authors, in two parts (a singel latin capital letter, a dot, and one or several latin capital letters and a dot) or in one part (several latin capital letters followed by a dot), " ?P<refauteur>([A-Z][.]\s+[A-ZÀÁÉÈÆŒ]{2,}[.])|([A-ZÀÁÉÈÆŒ]+[.])? )\s*" with a possible reference to a work of this author.
The result of applying the filter is a dictionany whose keys are "citation", "refauteur" and "refouvr". When the analysed text does not match parts of the filter, I am expecting a None suc as
Output: <re.Match object; span=(6, 7), match='ῶ'> {'citation': 'ῶ', 'refauteur': None, 'refouvr': None}
So, I don't understand why, when the analyzed text is either a "-" or a space, I get such outputs as :
Output: <re.Match object; span=(5, 5), match=''> {'citation': None, 'refauteur': '', 'refouvr': None}
<re.Match object; span=(7, 8), match=' '> {'citation': None, 'refauteur': '', 'refouvr': None}
<re.Match object; span=(8, 8), match=''> {'citation': None, 'refauteur': '', 'refouvr': None}
where refauteur is null string and not None as are citation and refouvr.
I don't see what is wrong in my regular expression.
However, I'm going to include the "-" in the possible greek letters. In "τελέω-ῶ", this "-" means that the real verb "τελῶ" results from the contraction of έω into ῶ. But the issue remains for single spaces.
Arbiel
using Ubuntu 22.04.5 LTS, Python 3.10.12
having substituted «https://www.lilo.org/fr/» to google, «https://protonmail.com/» to any other unsafe mail service and bépo to azerty (french keyboard layouts)
Posts: 4,904
Threads: 79
Joined: Jan 2018
Jan-08-2026, 06:10 PM
(This post was last modified: Jan-08-2026, 06:16 PM by Gribouillis.)
(Jan-08-2026, 03:56 PM)arbiel Wrote: I don't see what is wrong in my regular expression. To summarize the problem as I understand it, let's call S a greek sentence, A an author and R a book reference. Your regular expression means S|A?R?, that is to say a greek sentence or an optional author followed by an optional reference. The problem is that the group A?R? has many empty matches. I suggest to replace it by a pattern such as S|AR?|R which would eliminate all empty matches. Only you would have two 'refouvr' groups instead of one.
Here is a modified implementation
#!/usr/bin/env python3.8
# -*- coding: utf-8 -*-
#
import sys
import os
from PyQt5 import QtWidgets
from PyQt5.QtWidgets import QApplication
import re
class globl () :
def __init__(gbl):
gbl.filtreCitation = filtrecitation()
gbl.corps = ""
gbl.auteur = None
gbl.refauteur = None
gbl.refouvr = None
def filtrecitation () :
sentence = r"(?:(?:\s*[Ͱ-Ͽἀ-῾-]+[',;:]?)+)"
author = r"(?:(?:[A-Z][.]\s+[A-ZÀÁÉÈÆŒ]{2,}[.])|(?:[A-ZÀÁÉÈÆŒ]+[.])\s*)"
refouvr = r"(?:[A-Z][.]\s*[A-Z][.]|[A-Z][a-zàáéèæœ]+[.])"
filtre = f"(?P<citation>{sentence})|(?P<refauteur>{author})(?P<refouvr1>{refouvr})?|(?P<refouvr2>{refouvr})"
return re.compile(filtre,re.X)
def balisePara (ligne):
print(ligne)
for m in re.finditer(gbl.filtreCitation,ligne) :
print(m,m.groupdict())
return()
def yml_vocab(leTexte):
for ligne in leTexte.split('\n') :
paraHtml=balisePara(ligne)
return()
grec="""τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
"""
def principale() :
global gbl
gbl=globl()
yml_vocab(grec)
return
if __name__ == '__main__':
principale()And the result
Output: λ python paillasse/pf/arbiel.py
τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσ
μαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
<re.Match object; span=(0, 7), match='τελέω-ῶ'> {'citation': 'τελέω-ῶ', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(9, 14), match='Impf.'> {'citation': None, 'refauteur': None, 'refouvr1': None, 'refouvr2': 'Impf.'}
<re.Match object; span=(14, 23), match=' ἐτέλουν,'> {'citation': ' ἐτέλουν,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(26, 34), match=' τελέσω,'> {'citation': ' τελέσω,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(39, 44), match=' τελῶ'> {'citation': ' τελῶ', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(50, 59), match=' ἐτέλεσα,'> {'citation': ' ἐτέλεσα,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(63, 72), match=' τετέλεκα'> {'citation': ' τετέλεκα', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(83, 97), match=' τελεσθήσομαι,'> {'citation': ' τελεσθήσομαι,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(101, 112), match=' ἐτελέσθην,'> {'citation': ' ἐτελέσθην,', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(116, 127), match=' τετέλεσμαι'> {'citation': ' τετέλεσμαι', 'refauteur': None, 'refouvr1': None, 'refouvr2': None}
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
<re.Match object; span=(3, 14), match='PLAT. Crat.'> {'citation': None, 'refauteur': 'PLAT. ', 'refouvr1': 'Crat.', 'refouvr2': None}
<re.Match object; span=(22, 26), match='Rsp.'> {'citation': None, 'refauteur': None, 'refouvr1': None, 'refouvr2': 'Rsp.'}
<re.Match object; span=(41, 46), match='XÉN. '> {'citation': None, 'refauteur': 'XÉN. ', 'refouvr1': None, 'refouvr2': None}
<re.Match object; span=(46, 51), match='DÉM. '> {'citation': None, 'refauteur': 'DÉM. ', 'refouvr1': None, 'refouvr2': None}
« We can solve any problem by introducing an extra level of indirection »
Posts: 86
Threads: 35
Joined: Feb 2020
Great.
However, I will replace S|AR?|R by S|A|R as I don't mind separating the A from the R. This will save me to have refouvr1 and refouvr2.
I also will add F (S|A|R|F), for french sentences, to avoid having as many matches as single latin characters.
I did not know the "f" fonction and I understand it subsitutes the variables (enclosed in curly braces) by their values.
Thanks a lot Gribouillis
Have a nice day.
Arbiel
using Ubuntu 22.04.5 LTS, Python 3.10.12
having substituted «https://www.lilo.org/fr/» to google, «https://protonmail.com/» to any other unsafe mail service and bépo to azerty (french keyboard layouts)
Posts: 7,431
Threads: 125
Joined: Sep 2016
If use module regex it support Unicode properties like \p{Greek} which make it easier.
Then you can tokenize more cleanly:
import regex as re
TOK = re.compile(r"""
(?P<GREEK>\p{Greek}+)
|(?P<AUTHOR>(?:[A-Z]\.\s+[A-ZÀÁÉÈÆŒ]{2,}\.|[A-ZÀÁÉÈÆŒ]+\.) )
|(?P<WORK>(?:[A-Z]\.\s*[A-Z]\.|[A-Z][a-zàáéèæœ]+\.) )
|(?P<SPACE>\s+)
|(?P<OTHER>.)
""", re.X)
author_map = {"PLAT.": "Platon", "XÉN.": "Xénophon"} # etc.
text = """τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc. ; XÉN. DÉM. etc.
"""
out = []
for m in TOK.finditer(text):
kind = m.lastgroup
val = m.group()
if kind == "AUTHOR":
out.append(author_map.get(val.strip(), val))
else:
out.append(val)
result = "".join(out)
print(result)Output: τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. Platon Crat. 384 a, Rsp. 378 b, etc. ; Xénophon DÉM. etc.
Posts: 1,300
Threads: 151
Joined: Jul 2017
You don't need a regex to replace the author names for their full names:
text = """τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ;
f. PLAT. Crat. 384 a, Rsp. 378 b, etc.; XÉN. Vat. 385 c, Rsq. 379 d, etc.; DÉM. Bat. 386 d, Rsq. 380 e;
"""
text_list = text.split()
author_map = {"PLAT.": "Platon", "XÉN.": "Xénophon", "DÉM.": "Démosthene"}
for i in range(len(text_list)):
for key in author_map.keys():
if key == text_list[i]:
text_list[i] = author_map[key]
res = ' '.join(text_list)Gives res:
'τελέω-ῶ (Impf. ἐτέλουν, f. τελέσω, att. τελῶ ; ao. ἐτέλεσα, pf. τετέλεκα ; pass. f. τελεσθήσομαι, ao. ἐτελέσθην, pf. τετέλεσμαι) ; f. Platon Crat. 384 a, Rsp. 378 b, etc. ; Xénophon Vat. 385 c, Rsq. 379 d, etc. ; Démosthene Vat. 386 d, Rsq. 380 e;' If I knew exactly what a citation should look like, we could find that too! How do we know where a citation starts and ends?
How to identify a citation?
stellacaroline5 likes this post
|