Python Forum
[SOLVED] Open file, and insert space in string?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[SOLVED] Open file, and insert space in string?
#1
Question 
Hello,

Before I look deeper, is there a simple way to loop through each line of a text file, and append a space within a string?

I need to perform that task, so that URLs are no longer "glued" to the preceding word.

Thank you.

#blah<a href
pattern = re.compile(r'(\w)<a href')
for file in glob("*.html"):
	with open(file, 'r', encoding="utf-8") as f:
		for line in f:
			m = pattern.search(line)
			if m:
				#print("Found:",m.group(1))
				print(line)
---
Edit: Apparently, you can't edit a file in memory and just write it back to disk. So I guess the solution is to loop through each line, find all the occurences through re.finditer(), create a new line with an extra space where it fits, and write the output to a new file.

pattern = re.compile("([A-Za-z,])<a href")
with open(file, 'r', encoding="utf-8") as f:
	for line in f:
		for match in pattern.finditer(line):
			print(match.start(), match.end(), match.group())
---
Edit #KISS

#apparently, doesn't support \w to match any alphabetic character
cat input.html | sed -r "s@([A-Za-z,])<a href@\1 <a href@g" > output.html
Reply
#2
Hope I understood you correctly!

Maybe you have more than 1 example of XXX<a href in a line of text, so this should cater for that.

This replaces all instances of \w+<a href with whatever \w+ is + space + <a href

I think it can be done better, but I am still thinking about that!

import regex

html_file = '/home/pedro/temp/ahref.html'
savename = '/home/pedro/temp/spaced_ahref.html'
e = regex.compile(r'(\w+)(<a href)')

with open(html_file, 'r') as html, open(savename, 'a') as outfile:
    lines = html.readlines()
    for i in range(len(lines)):
        res = e.findall(lines[i])
        for r in res:
            old = r[0] + r[1]
            new = r[0] + ' ' + r[1]
            newtext = regex.sub(old, new, lines[i]) + '\n'
            outfile.write(newtext)
savename looks like:

Output:
blabla <a href="https://www.w3schools.com">Visit W3Schools</a> blabla <a href="https://www.w3schools.com"><img border="0" alt="W3Schools" src="logo_w3s.gif" width="100" height="100"></a> blabla <a href="mailto:[email protected]">Send email</a> blabla <a href="tel:+4733378901">+47 333 78 901</a> blabla <a href="sms:+4733378901?body=Please%20contact%20me.">Send a SMS</a> blabla <a href="#section2">Go to Section 2</a> blabla <a href="javascript:alert('Hello World!');">Execute JavaScript</a>
Reply
#3
You could perhaps avoid potato code by reading the whole html file in memory an using re.sub()
from pathlib import Path
import re

def prepend_space(match):
    return ' ' + match.group(0)

pattern = re.compile("(?<=[A-Za-z,])<a href")
s = Path(file).read_text()
Path(file).write_text(pattern.sub(prepend_space, s))
« We can solve any problem by introducing an extra level of indirection »
Reply
#4
Thanks much!

#BAD: "from<a href" turned into "fro m<a href"
pattern = re.compile("([A-Za-z,])<a href")
#GOOD
#?<= is "positive lookbehind"
pattern = re.compile("(?<=[A-Za-z,])<a href")

file = "test.html" 

#UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 79083: character maps to <undefined>
#s = Path(file).read_text()
s = Path(file).read_text(encoding='UTF-8')
Path(file).write_text(pattern.sub(prepend_space, s))
I have a couple of questions:
1. Without the "positive lookbehind", I get "fro m<a href" instead of "from <a href": Why is that?
2. Why does prepend_space() work even though it's called without the "match" parameter? How does Python know? Is it because re.sub() looks for "match" by default?
Reply
#5
(May-28-2025, 05:36 AM)Winfried Wrote: 1. Without the "positive lookbehind", I get "fro m<a href" instead of "from <a href": Why is that?
Suppose that the string is spam<a href ...
  • Without the lookbehind assertion, the substring matching the pattern is m<a href ...
  • With the lookbehind assertion, the substring matching the pattern is <a href ...
(May-28-2025, 05:36 AM)Winfried Wrote: s = Path(file).read_text(encoding='UTF-8')
You could perhaps use the encoding='UTF-8' also in write_text(...)
(May-28-2025, 05:36 AM)Winfried Wrote: Why does prepend_space() work even though it's called without the "match" parameter?
It should not. It could be because you have a global variable named match and the function uses this global (or nonlocal) variable because it does not have a local match variable. Can you post a complete example?
« We can solve any problem by introducing an extra level of indirection »
Reply
#6
Ah, it's because the brackets in the pattern aren't used to keep the character in memory but simply because the syntax requires them. Makes sense.

Here's the code, where I don't understand how prepend_space() finds the match variable even though it's not called with that parameter:

from pathlib import Path
import re
 
def prepend_space(match):
    return ' ' + match.group(0)

#BAD: "from<a href" turned into "fro m<a href"
pattern = re.compile("([A-Za-z,])<a href")
#GOOD
#?<= is "positive lookbehind"
pattern = re.compile("(?<=[A-Za-z,])<a href")

file = "test.html" 

s = Path(file).read_text(encoding='UTF-8')
#HERE: Is prepend_space called with the match silently by re, hence the lack of parameter?
Path(file).write_text(pattern.sub(prepend_space, s),encoding='UTF-8')
--
Edit: Yup, it looks like re implicitly calls the function with "match" even if it's not mentioned in the call:
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single Match argument, and returns the replacement string. (source)
Reply
#7
(May-28-2025, 07:06 AM)Winfried Wrote: I don't understand how prepend_space() finds the match variable even though it's not called with that parameter
It is called internally. For every match of the pattern in the string, the .sub() method calls prepend_space() with this parameter. You could check this by adding a print expression in prepend_space(). You could see every time it is called.
« We can solve any problem by introducing an extra level of indirection »
Reply
#8
Some kind of magic :-)

Thanks very much.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question [Solved] Getting python's default 'printed' byte-string as string ? MvGulik 8 135 Apr-06-2026, 09:16 AM
Last Post: Dustbunny
  open a text file using list() Pedroski55 2 114 Feb-25-2026, 06:57 PM
Last Post: noisefloor
  If I open a file write or append, is the file loaded into RAM? Pedroski55 11 1,118 Jan-14-2026, 07:49 AM
Last Post: Pedroski55
Question [SOLVED] Linefeed when writing "f" strings to text file? Winfried 5 858 Nov-04-2025, 11:51 AM
Last Post: buran
  print does not open console in Linux Mint [Solved] Tycho_2025 5 804 Oct-04-2025, 08:52 AM
Last Post: Tycho_2025
Question [SOLVED] [Beautiful Soup] Replace tag.string from another file? Winfried 2 1,672 May-01-2025, 03:43 PM
Last Post: Winfried
Question [SOLVED] Right way to open files with different encodings? Winfried 3 10,291 Jan-18-2025, 02:19 PM
Last Post: Winfried
  [SOLVED] Sub string not found in string ? jehoshua 4 2,311 Dec-03-2024, 09:17 PM
Last Post: jehoshua
  [SOLVED] [Linux] Write file and change owner? Winfried 6 3,193 Oct-17-2024, 01:15 AM
Last Post: Winfried
  Trying to open depracated joblib file mckennamason 0 2,156 Sep-19-2024, 03:30 PM
Last Post: mckennamason

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020