File Name Parsing

millpond · (This post was last modified: Aug-25-2020, 07:28 AM by millpond.)

After minths of putzing around with toy code and simple example scripts, I decided to get started on some real code I will actually use. Extending my file name filter collection, but in Python (instead of Perl).

#!c:\python38\python38.exe


"""
This is designed to be a general filter for renaming files. 
All lines in FILTER section are meant to be replaceable for the situation at hand.
DANGER! DANGER!!! WILL ROBINSON!!!!!
This script is designed to be run ONLY in an isolated directory with select files. 
It *WILL* likely kill a system directory. 
"""



import os
import pathlib
import time 
import re                                           
from pathlib import Path                            


path = '.'

 
for file in os.listdir(path):
    dir = [os.path.join(path, file)]    # Directory Listing 
    for filename in dir:                # Each file - must be in list format else will parse as chars
        newname = filename              # Keep original name
        if Path(newname).is_file():     # Only if file
            if  re.match(".py\Z",newname) : break #dont do .py files
            base = os.path.splitext(newname)   # split name[0] and extension[1]
            extension = base[1]  #extension
            if  extension ==  '.py' : break #dont do .py files
            newfile = base[0]    #file name. Remove .extensions for now. 

###################################### FILTERS 
            newfile = re.sub("\\-", " ", newfile )  #substitutions here, with mutation
            newfile = re.sub(r'\d\d\d\d\d+', r' ',newfile)   # Kill numbers greater than date
            newfile = re.sub(r'_', r' ',newfile)
            newfile = re.sub("(?!^)\\.", r" ",newfile) # Kill dots. Lookbehind to make sure not to delete dots at start of file.
            newfile = re.sub("\s+"," ",newfile)  # Kill extra spaces

###################################################################################################


            newfilext = newfile+extension


# Inelegant way to make sure equal strings match as newfilext cannot equal filename with .(space)\ at start
            a = newfilext                       
            a = re.sub(' ','', newfilext )  
            a.strip()
            b = re.sub(' ','', filename )
            b.strip()
#            print(a, b) # For testing
            if a != b : print(a,b) 

#            print(newfilext+filename)   # For testing
            if a != b :  # dont overwrite existing 
                print(f"{filename} is being moved to {newfilext}" )
                os.rename(filename,newfilext)                             # rename old to new 
#                time.sleep(3)  # For testing

The code works, albeit with some rough edges that need to be ironed out as this expands to a couple of hundred lines.

Python, I see has a few gotchas, like being finicky about the backslash and requiring extras where I am not used to. \. doesnt seem to work to escape a dot! Need \\. And '.'+foo seemd to want to produce '.(space)/foo' for file operations.

One potential issue here is that these scripts are disastrous if used in the wrong directories.
In Perl I can force the scripts to run at a minimum of 3-4 levels deep in the directory structure. I dont know how to do this in Python (or how to count the bloody backslashes!). I can limit them to a specific directory, but that is not practical here, as these are run from a bunch of directories.

Is re the best library for regexes? In normal practice I often use them in place of split, but I dont see a simple way of doing that with re.sub.

Any pointers on fixing, cleaning up the code would be appreciated, as well as non-trivila links to doing some heavy lifting with real world regexes would be greatly appreciated. The basic tutorials are often so simplistic as to be confusing.

Until I get up to speed, I will be over-commenting with my scripts. A screwup can and has caused the loss of hundreds of files here.

I am over the 'hump' now. Python has passed the *acid test* and I do love the simplicity of file and directory access here. And parsing arrays like strings!

ndc85430 · Aug-25-2020, 07:32 AM

(Aug-25-2020, 07:28 AM)millpond Wrote: Until I get up to speed, I will be over-commenting with my scripts. A screwup can and has caused the loss of hundreds of files here.

Learn to write automated tests using, e.g. the unittest module or third-party libraries like pytest. That will allow you to have confidence that your software works as intended and continues to work as you make changes to it.

bowlofred · Aug-25-2020, 08:41 AM

(Aug-25-2020, 07:28 AM)millpond Wrote: Python, I see has a few gotchas, like being finicky about the backslash and requiring extras where I am not used to. \. doesnt seem to work to escape a dot! Need \\. And '.'+foo seemd to want to produce '.(space)/foo' for file operations.

In a regular string, backslashes are special. As an example, "\n" is a newline character, not a backslash followed by an n. When you've got regular expressions with several backslashes, this is inconvenient. You can instead use "r-strings" where the backslash character is not special. r"\." should behave as you expect. You seem to be using r-strings for some, but not all, of your regular expressions above.

I'm not exactly sure what you're saying about the "space" thing. Can you make an example?

Quote:In Perl I can force the scripts to run at a minimum of 3-4 levels deep in the directory structure. I dont know how to do this in Python (or how to count the bloody backslashes!). I can limit them to a specific directory, but that is not practical here, as these are run from a bunch of directories.

I'm not sure I follow. os.walk could be used to descend into a directory, but I don't see it used here. You're right there's no simple flag to prune at a depth. You'd have to count the depth you've reached and prune yourself.

But in your code I just see os.listdir(), which doesn't descend further. (And you should probably consider using os.scanndir() instead. It's faster if you need any information about the files other than the name).

Quote:Is re the best library for regexes? In normal practice I often use them in place of split, but I dont see a simple way of doing that with re.sub.

Please give an example. Perl and python regexes are very similar.

millpond · Aug-26-2020, 04:57 AM

(Aug-25-2020, 07:32 AM)ndc85430 Wrote: Learn to write automated tests using, e.g. the unittest module or third-party libraries like pytest. That will allow you to have confidence that your software works as intended and continues to work as you make changes to it.

The problem is that with the time I would need to spend programming an adequate test suite, is more than the time I would spend in normal troubleshooting.

For example converting 'ln,fn - Title - Series - date' to 'Ttile (Series) date - fn,ln' for nonfiction, and 'fn,ln - Title (date) _ Series' for fiction. I would need AI with Wikipedia API access to do that calibre of testing.
My greatest problems are unintended consequences of perfectly functioning code!

That the problem with algos. They cannot anticipate the black cyber-swan.

(Think 2007 and quants...)

millpond · Aug-26-2020, 06:26 AM

@bowlofred

With regex, at least the PCRE I am used to \. is always a period, and . means 'any character'.
In python it seems r'.' means any character.
Lets do regex strings:
a = "fee...fi....fo.....fum"
b = re.sub("\.","\-",a)
-> fee\-\-\-fi\-\-\-\- (&etc)
and I would need to give a bare "_" for it to work as
-> fee___fi____fo______fum
b = re.sub(r'.',r'_',a)
-> _____________________
The entire string is wiped out.

r'.' is NOT a raw character, at least with the re class.
And escaping '-' (\-)is not working as expected in regex mode.

In Perl I would typically use something like:
x =~ s/^.*(fee).+(fi).+(fo).+(fum).*$/$1,$3,$2,$4/ -> ...fee...fo...fi...fum
I do see that the basic syntax seems OK at:
https://regex101.com/

Though re apparently uses \1 instead of the ancient \$1 format.

I see there is a python-pcre module, but it seems even more obfuscated than a Perl poem.

I started with raw mode, and switched to regex ("foo") mode when I ran into problems.
Not a problem really, all languages have their peculiarities.

By space problem I mean (dot)(space)(backslach). If it write it as is Mybb will kill the space. Cocatenating '.'+'filename' was giving me that space which was screwing up the matching. Omitting the step wound up fixing the problem. os.path apparently subblied its own . for the var.

bowlofred · (This post was last modified: Aug-26-2020, 08:04 AM by bowlofred.)

(Aug-26-2020, 06:26 AM)millpond Wrote: With regex, at least the PCRE I am used to \. is always a period, and . means 'any character'.

The regex in python is the same here. "." is any character, and "\." is only the period.

>>> re.sub(".", "X", "hi.")
'XXX'
>>> re.sub("\.", "X", "hi.")
'hiX'

Quote:In python it seems r'.' means any character.

Either '.' or r'.' is a method of creating a string with a single period. The regex engine on receiving it interprets it as "any character".

Quote:Lets do regex strings:
a = "fee...fi....fo.....fum"
b = re.sub("\.","\-",a)

Perl had a rule that all valid backslash sequences in the regex engine (like \n being a newline character) were alphabetic characters. Therefore, you could always add a backslash before a symbol and it would be interpreted as just the raw symbol. Both - and \- would be interpreted as a dash (when outside of character set context).

Python doesn't have that rule. As \- isn't a valid escape sequence, it's interpreted as both characters during the replacement. From re.sub

DOCUMENTATION Wrote:repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as errors. Other unknown escapes such as \& are left alone.

Quote:b = re.sub(r'.',r'_',a)
-> _____________________
The entire string is wiped out.

r'.' is NOT a raw character, at least with the re class.

There's no such thing as a raw character. A "r-string" formulation does only one thing. It stops python from interpreting the backslash before handing it off to the regex engine. As "." has no backslashes, there is no difference between "." and r".". The "r-string" doesn't have anything to do with the regex engine. It's just setting a slightly different rule for how python strings are constructed.

>>> print("-->\t<--")  #This string has a tab character inside.
-->	<--
>>> print(r"-->\t<--") #This string has the two character sequence of a backslash and a letter t inside.
-->\t<--

Quote:And escaping '-' (\-)is not working as expected in regex mode.

- doesn't need escaping. It has no special meaning in a replacement string. It has no special meaning in a regex outside a character class, and you can't supply a character class in a replacement. Both python and perl will behave the same when - is used there, but only perl lets you also use \-

Quote:In Perl I would typically use something like:
x =~ s/^.*(fee).+(fi).+(fo).+(fum).*$/$1,$3,$2,$4/ -> ...fee...fo...fi...fum

Though re apparently uses \1 instead of the ancient \$1 format.

Seems about the same in python (although neither perl nor python will print the periods in the replaced string).

>>> a = "fee...fi....fo.....fum"
>>> re.sub("^.*(fee).+(fi).+(fo).+(fum).*$", r"\1,\3,\2,\4", a)
'fee,fo,fi,fum'

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Reading an ASCII text file and parsing data...	oradba4u	2	2,898	Jun-08-2024, 12:41 AM Last Post: oradba4u
	doing data treatment on a file import-parsing a variable	EmBeck87	15	7,985	Apr-17-2023, 06:54 PM Last Post: EmBeck87
	Modify values in XML file by data from text file (without parsing)	Paqqno	2	4,640	Apr-13-2022, 06:02 AM Last Post: Paqqno
	Parsing xml file deletes whitespaces. How to avoid it?	Paqqno	0	2,256	Apr-01-2022, 10:20 PM Last Post: Paqqno
	Parsing a syslog file	ebolisa	11	8,455	Oct-10-2021, 05:15 PM Last Post: snippsat
	Parsing a YAML file without changing the string content..?, Flask - solved.	SpongeB0B	2	3,903	Aug-05-2021, 08:02 AM Last Post: SpongeB0B
	Error while parsing tables from docx file	aditi	1	7,326	Jul-14-2020, 09:24 PM Last Post: aditi
	help parsing file	aslezak	2	4,006	Oct-22-2019, 03:51 PM Last Post: aslezak
	Python Script for parsing dictionary values from yaml file	pawan6782	3	7,702	Sep-04-2019, 07:21 PM Last Post: pawan6782
	Parsing an MBOX file	Oliver	1	12,270	May-26-2019, 07:12 AM Last Post: heiner55

File Name Parsing

User Panel Messages

Announcements