Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[solved] re.split issue
#1
Hi,
It is a basic question but i'm stuck:
  • one basckslash is ignored and i do not understand why (see output)
  • the backslash with a number after (\15 here)
As you can imagine, pathes can come from from any os and have ever recorded using os.walk.

Thanks for your contribution

line = 'C:\XXXX\YYYY/./OOOOO\mlmlml\15xxx\TTTT'
reg = re.split('/|\\\\', line) 
print(reg)
Output:
['C:', 'XXXX', 'YYYY', '.', 'OOOOO', 'mlmlml\rxxx', 'TTTT']
Reply
#2
You can try raw string

import re

line = r'C:\XXXX\YYYY/./OOOOO\mlmlml\15xxx\TTTT'
reg = re.split('/|\\\\', line) 
print(reg)
Output:
['C:', 'XXXX', 'YYYY', '.', 'OOOOO', 'mlmlml', '15xxx', 'TTTT']
Reply
#3
Thanks Axel

However since i'm using a list, how do you proceed with the following ?

line = 'C:\XXXX\YYYY/./OOOOO\mlmlml\15xxx\TTTT'
myList = [line, line]
for arg in myList:
    arg = r'%s',arg
    reg = re.split('/|\\\\', arg)
    print(reg
Output:
['C:', 'XXXX', 'YYYY', '.', 'OOOOO', 'mlmlml\rxxx', 'TTTT'] ['C:', 'XXXX', 'YYYY', '.', 'OOOOO', 'mlmlml\rxxx', 'TTTT']
Reply
#4
Hi,

general advise: os.walk is legacy, you may want to take a look at the newer pathlib module, which is Python's standard for dealing with directory paths and filenames. pathlib has more and better option, especially with regards to dealing with Windows and Unix path as well as splitting full paths into subsections.

Regards, noisefloor
Gribouillis likes this post
Reply
#5
@noisefloor: ok thanks for the feedback; after investigations and tests, the following snipet does excatly what i'm looking for and no regex is needed.

from pathlib import Path
tree = []
for pathObj, dirnames, filenames in Path(f"{workingDir}").walk():
  for file in filenames:
        if file == workingsFile:
            tree.append(pathObj.parts)
noisefloor likes this post
Reply
#6
(Nov-13-2025, 02:01 PM)paul18fr Wrote: Thanks Axel

However since i'm using a list, how do you proceed with the following ?

line = 'C:\XXXX\YYYY/./OOOOO\mlmlml\15xxx\TTTT'
myList = [line, line]
for arg in myList:
    arg = r'%s',arg
    reg = re.split('/|\\\\', arg)
    print(reg
Output:
['C:', 'XXXX', 'YYYY', '.', 'OOOOO', 'mlmlml\rxxx', 'TTTT'] ['C:', 'XXXX', 'YYYY', '.', 'OOOOO', 'mlmlml\rxxx', 'TTTT']

use r'{}'.format

import re

line = r'C:\XXXX\YYYY/./OOOOO\mlmlml\15xxx\TTTT'
myList = [line, line]
for arg in myList:
    reg = re.split('/|\\\\', r'{}'.format(arg))
    print(reg)
Output:
['C:', 'XXXX', 'YYYY', '.', 'OOOOO', 'mlmlml', '15xxx', 'TTTT'] ['C:', 'XXXX', 'YYYY', '.', 'OOOOO', 'mlmlml', '15xxx', 'TTTT']
Reply
#7
Python does not have "raw" strings, python programs may have raw string literals. When python parses:
line = "C:\XXXX\YYYY/./OOOOO\mlmlml\15xxx\TTTT"
It creates a str with any escape sequences in the string literal converted to characters. In your example, \15 is interpreted as an escape sequence and it replaces \15 with the ansii character 13 (octal 15) which is a non-printable character, Carriage Return. It is interesting that this shows up as \r in your output, but even that is a lie. There is no \r in line either, just a Carriage Return that gets printed out as "\r".

The only time you need to worry about raw is when typing string literals into a program. A backslash in a file path returned by os.walk is just a backslash. To hopefully make this clear, I modified your program to create a str that contains single backslashes.
import re

line = "C:\\XXXX\\YYYY/./OOOOO\\mlmlml\\15xxx\\TTTT"
print(line)
reg = re.split("/|\\\\", line)
print(reg)
Output:
C:\XXXX\YYYY/./OOOOO\mlmlml\15xxx\TTTT ['C:', 'XXXX', 'YYYY', '.', 'OOOOO', 'mlmlml', '15xxx', 'TTTT']
You can see in the output that the '\\' in the string literal are converted to '\' in line. Split works as expected, splitting line at each / or \.
buran likes this post
Reply
#8
Moral of the story is: Do not create paths that have escape sequences in them. That is asking for trouble!

import regex

line = 'C:\XXXX\YYYY/./OOOOO\mlmlml\15xxx\TTTT'
# manually altering the string to give this works as deanhystad showed, but you don't want to do that for many paths
# we want to automatically alter the string, but that is trés difficile!
line = "C:\\XXXX\\YYYY/./OOOOO\\mlmlml\\15xxx\\TTTT"

# Confucious say: "Wise geek not make path with escape sequence in."
# put anything in [] that might be found in a file or folder name
f  = regex.compile(r'[\w\.-]+')
line_good = 'C:\XXXX\YYYY/./OOOOO\mlmlml\xxx15\TTTT' # still \x causes problems
# SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 27-28: truncated \xXX escape
line_better = 'C:\XXXX\YYYY/./OOOOO\mlmlml\yxx15\TTTT' # no problem

res = f.findall(line_better)
print(res)
Gives:

Output:
['C', 'XXXX', 'YYYY', '.', 'OOOOO', 'mlmlml', 'yxx15', 'TTTT']
Reply
#9
Thanks all for the feedbacks and the explanation to figure out the escape problem.

However keep in mind the code i provided is a (very basic) test case i made to reproduce the issue; as i said, pathes came from os.walk and i haven't the hand on; the best solution is to use pathlib as mentioned by noisefloor
Reply
#10
I don't know why my post is gone, but you want to use pathlib.
Often regex is not the best choice to solve a problem. Sometimes there are already well tested mature function/classes to do the same.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [solved] regex issue paul18fr 3 611 Oct-19-2025, 11:21 PM
Last Post: Pedroski55
  [split] Issue installing selenium Akshat_Vashisht 1 2,518 Oct-18-2023, 02:08 PM
Last Post: Larz60+
  [split] Very basic coding issue aary 4 4,185 Jun-03-2020, 11:59 AM
Last Post: buran
  [split] Is there any issue related to path defined somewhere purnima1 2 3,760 Sep-05-2018, 06:28 AM
Last Post: purnima1

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020