Posts: 265
Threads: 117
Joined: Aug 2018
May-27-2025, 04:45 PM
(This post was last modified: May-27-2025, 04:45 PM by Winfried.)
Hello,
Before I look deeper, is there a simple way to loop through each line of a text file, and append a space within a string?
I need to perform that task, so that URLs are no longer "glued" to the preceding word.
Thank you.
#blah<a href
pattern = re.compile(r'(\w)<a href')
for file in glob("*.html"):
with open(file, 'r', encoding="utf-8") as f:
for line in f:
m = pattern.search(line)
if m:
#print("Found:",m.group(1))
print(line)---
Edit: Apparently, you can't edit a file in memory and just write it back to disk. So I guess the solution is to loop through each line, find all the occurences through re.finditer(), create a new line with an extra space where it fits, and write the output to a new file.
pattern = re.compile("([A-Za-z,])<a href")
with open(file, 'r', encoding="utf-8") as f:
for line in f:
for match in pattern.finditer(line):
print(match.start(), match.end(), match.group())---
Edit #KISS
#apparently, doesn't support \w to match any alphabetic character
cat input.html | sed -r "s@([A-Za-z,])<a href@\1 <a href@g" > output.html
Posts: 1,301
Threads: 151
Joined: Jul 2017
Hope I understood you correctly!
Maybe you have more than 1 example of XXX<a href in a line of text, so this should cater for that.
This replaces all instances of \w+<a href with whatever \w+ is + space + <a href
I think it can be done better, but I am still thinking about that!
import regex
html_file = '/home/pedro/temp/ahref.html'
savename = '/home/pedro/temp/spaced_ahref.html'
e = regex.compile(r'(\w+)(<a href)')
with open(html_file, 'r') as html, open(savename, 'a') as outfile:
lines = html.readlines()
for i in range(len(lines)):
res = e.findall(lines[i])
for r in res:
old = r[0] + r[1]
new = r[0] + ' ' + r[1]
newtext = regex.sub(old, new, lines[i]) + '\n'
outfile.write(newtext)savename looks like:
Output: blabla <a href="https://www.w3schools.com">Visit W3Schools</a>
blabla <a href="https://www.w3schools.com"><img border="0" alt="W3Schools" src="logo_w3s.gif" width="100" height="100"></a>
blabla <a href="mailto:[email protected]">Send email</a>
blabla <a href="tel:+4733378901">+47 333 78 901</a>
blabla <a href="sms:+4733378901?body=Please%20contact%20me.">Send a SMS</a>
blabla <a href="#section2">Go to Section 2</a>
blabla <a href="javascript:alert('Hello World!');">Execute JavaScript</a>
Posts: 4,904
Threads: 79
Joined: Jan 2018
You could perhaps avoid potato code by reading the whole html file in memory an using re.sub()
from pathlib import Path
import re
def prepend_space(match):
return ' ' + match.group(0)
pattern = re.compile("(?<=[A-Za-z,])<a href")
s = Path(file).read_text()
Path(file).write_text(pattern.sub(prepend_space, s))
« We can solve any problem by introducing an extra level of indirection »
Posts: 265
Threads: 117
Joined: Aug 2018
May-28-2025, 05:36 AM
(This post was last modified: May-28-2025, 05:36 AM by Winfried.)
Thanks much!
#BAD: "from<a href" turned into "fro m<a href"
pattern = re.compile("([A-Za-z,])<a href")
#GOOD
#?<= is "positive lookbehind"
pattern = re.compile("(?<=[A-Za-z,])<a href")
file = "test.html"
#UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 79083: character maps to <undefined>
#s = Path(file).read_text()
s = Path(file).read_text(encoding='UTF-8')
Path(file).write_text(pattern.sub(prepend_space, s))I have a couple of questions:
1. Without the "positive lookbehind", I get "fro m<a href" instead of "from <a href": Why is that?
2. Why does prepend_space() work even though it's called without the "match" parameter? How does Python know? Is it because re.sub() looks for "match" by default?
Posts: 4,904
Threads: 79
Joined: Jan 2018
May-28-2025, 06:23 AM
(This post was last modified: May-28-2025, 06:25 AM by Gribouillis.)
(May-28-2025, 05:36 AM)Winfried Wrote: 1. Without the "positive lookbehind", I get "fro m<a href" instead of "from <a href": Why is that? Suppose that the string is spam<a href ...- Without the lookbehind assertion, the substring matching the pattern is
m<a href ...
- With the lookbehind assertion, the substring matching the pattern is
<a href ...
(May-28-2025, 05:36 AM)Winfried Wrote: s = Path(file).read_text(encoding='UTF-8') You could perhaps use the encoding='UTF-8' also in write_text(...)
(May-28-2025, 05:36 AM)Winfried Wrote: Why does prepend_space() work even though it's called without the "match" parameter? It should not. It could be because you have a global variable named match and the function uses this global (or nonlocal) variable because it does not have a local match variable. Can you post a complete example?
« We can solve any problem by introducing an extra level of indirection »
Posts: 265
Threads: 117
Joined: Aug 2018
May-28-2025, 07:06 AM
(This post was last modified: May-28-2025, 07:55 AM by Winfried.)
Ah, it's because the brackets in the pattern aren't used to keep the character in memory but simply because the syntax requires them. Makes sense.
Here's the code, where I don't understand how prepend_space() finds the match variable even though it's not called with that parameter:
from pathlib import Path
import re
def prepend_space(match):
return ' ' + match.group(0)
#BAD: "from<a href" turned into "fro m<a href"
pattern = re.compile("([A-Za-z,])<a href")
#GOOD
#?<= is "positive lookbehind"
pattern = re.compile("(?<=[A-Za-z,])<a href")
file = "test.html"
s = Path(file).read_text(encoding='UTF-8')
#HERE: Is prepend_space called with the match silently by re, hence the lack of parameter?
Path(file).write_text(pattern.sub(prepend_space, s),encoding='UTF-8')--
Edit: Yup, it looks like re implicitly calls the function with "match" even if it's not mentioned in the call:
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single Match argument, and returns the replacement string. ( source)
Posts: 4,904
Threads: 79
Joined: Jan 2018
(May-28-2025, 07:06 AM)Winfried Wrote: I don't understand how prepend_space() finds the match variable even though it's not called with that parameter It is called internally. For every match of the pattern in the string, the .sub() method calls prepend_space() with this parameter. You could check this by adding a print expression in prepend_space(). You could see every time it is called.
« We can solve any problem by introducing an extra level of indirection »
Posts: 265
Threads: 117
Joined: Aug 2018
Some kind of magic :-)
Thanks very much.
|