Posts: 1,300
Threads: 151
Joined: Jul 2017
Recently there was a thread about opening enormous files, > 300GB and processing them line by line. People often want to do this.
My question is: when we open a file write mode or append mode, is the whole file loaded into memory, or do we just get a pointer to the last line of the file?
If I do this, the whole file will be loaded:
test_file = '/home/peterr/temp/bits&bytes.txt'
with open(test_file, 'r') as infile:
print(f'infile size = {sys.getsizeof(infile)}')
data = infile.read()
print(f'data size = {sys.getsizeof(data)}')If you open the file with a generator, you get 1 line at a time, and the generator uses hardly any memory, 232 bytes.
I think, if you open a file in append mode, what you actually get is a pointer to the last line of the file. I think you don't load the whole file into memory.
Not many people have northwards of 350GB RAM in their laptop. You may not even have 350GB of free drive space. It could be that the file has only 1 big line. I know the file can be loaded in chunks and we can process the chunks.
I tried the following and neither infile nor outfile are very big, just 216 bytes.
import sys
test_file = '/home/peterr/temp/bits&bytes.txt'
# open a file append and add lines, 100 at a time
for j in range(5):
# outfile size is always 216
with open(test_file, 'a') as outfile:
print(f'outfile size = {sys.getsizeof(outfile)}')
for i in range(1,101):
outfile.write(f'Line {i}: Hello Binary World!\n')
print(f'outfile size = {sys.getsizeof(outfile)}')
# infile size is always 216
with open(test_file, 'r') as infile:
print(f'infile size = {sys.getsizeof(infile)}')
# data size obviously gets bigger each time around
data = infile.read()
print(f'data size = {sys.getsizeof(data)}')So I think, if I did this, I would never overload RAM. Assuming I have enough drive space to accommodate another 350GB file, will this work without overloading RAM? Later on delete the original file if it is no longer needed.
def gen_loader(test_file):
with open(test_file, 'r', ) as infile:
for line in infile:
yield line
test_file = '/home/peterr/temp/bits&bytes.txt' # big daddy of a file
result_file = '/home/peterr/temp/result_file.txt' # to begin with result_file does not exist
with open(result_file, 'a') as res:
for line in gen_loader(test_file):
# process line somehow
newline = 'XYZ' + line
res.write(newline)
print(f'res size = {sys.getsizeof(res)}') # 216Have understood the technicalities of file input and output correctly?
Posts: 1,602
Threads: 3
Joined: Mar 2020
(Jan-11-2026, 12:40 AM)Pedroski55 Wrote: My question is: when we open a file write mode or append mode, is the whole file loaded into memory, or do we just get a pointer to the last line of the file?
Simply open()ing a file doesn't load any part of the data of the file into memory, whether it's opened in read, write, or append mode. You have to somehow cause a read() before the data is read.
If you don't read the whole thing into memory, then it's not stored in memory. You could iterate over the file line-by-line with a generator or several kinds of loops and the total size of the file doesn't matter to RAM.
Likewise, if you're just writing line-by-line in a loop (and the lines are reasonably sized), the whole file is not stored in RAM. Constantly closing and re-opening the file in append mode doesn't help.
(Jan-11-2026, 12:40 AM)Pedroski55 Wrote: I tried the following and neither infile nor outfile are very big, just 216 bytes.
infile and outfile are simply filehandle objects. They will not increase in size. You can't estimate the amount of RAM that is being used by looking at those variables.
If instead you did:
infile = open(file)
data = read(infile) Then infile will always be the same size, but data could be huge if the file is large.
Posts: 2,198
Threads: 12
Joined: May 2017
def gen_loader(test_file):
with open(test_file, 'r', ) as infile:
for line in infile:
yield lineDie Geneator-Funktion benötigt pro Iteration nur RAM für eine Zeile.
Mein erster Gedanke bei sowas ist immer, wie man das ausnutzen kann.
Um so ein Problem herbeizuführen, kann man eine 100 GiB große Datei erstellen, die leer ist. Unter Windows hab ich das nicht getestet. Unter Linux wird der Speicherplatz nicht belegt, nur reserviert. D.h. es werden keine Bytes geschrieben, aber wenn man die Datei liest, bekommt man nur null-bytes.
Ich hatte mal ein Testprogramm geschrieben und mittels ulimit Python auf 4 GiB RAM beschränkt.
Das kommt dabei raus:
Error: [deadeye@nexus ~]$ ulimit -Sv 4194304 ; python mm.py
empty.bin created
empty.bin deleted
Traceback (most recent call last):
File "/home/deadeye/mm.py", line 34, in <module>
for line in line_printer():
~~~~~~~~~~~~^^
File "/home/deadeye/mm.py", line 29, in line_printer
for line in fd: # <- MemoryError wird hier ausgelöst
^^
MemoryError
Code:
import os
from contextlib import contextmanager
from pathlib import Path
@contextmanager
def create_file() -> Path:
file = Path("empty.bin")
file.unlink(missing_ok=True)
file.touch()
# 100 GiB 0-bytes
# nur mit Linux getestet
os.truncate(file, 100 * 1024**3)
print(f"{file} created")
try:
yield file
finally:
file.unlink()
print(f"{file} deleted")
def line_printer():
with create_file() as file:
with file.open("rb") as fd:
for line in fd: # <- MemoryError wird hier ausgelöst
yield line
if __name__ == "__main__":
for line in line_printer():
pass
Posts: 232
Threads: 0
Joined: Jun 2019
Hi,
open(filename) as file does exactly what the function name is implying: it opens the file and returns a file object. See Python documentation on the open function. No file content is read at this point.
The file object offers various methods to deal with the file object like reading from the file. file_object.read()and file_object.readlines() reads the complete file into memory, file_object.readline() reads a single line, respectively reads until a new line character is reached, for line in file_object: iterates line by line over the file, file_object.read(size) reads a portion of the file defined by size. Acc. to Python's documentation, trying to read files twice as big as the available memory can cause problems (for whatever reason - feel free to investigate yourself :-) ). For details and more method of file objects, see Pythons' documentation.
Regards, noisefloor
Posts: 1,300
Threads: 151
Joined: Jul 2017
Danke für die Hinweise! Thanks for the tips.
Quote:Die Geneator-Funktion benötigt pro Iteration nur RAM für eine Zeile.
Mein erster Gedanke bei sowas ist immer, wie man das ausnutzen kann.
Ja genau, was vonnöten ist, ist eine Art Revers-Generator, der dann die bearbeiteten Zeilen eins für eins in die Ausgangsdatei schreibt. Ich denke, das ist was append tut.
The generator is not a memory problem, as long as the lines are not enormously long. We could use readlines() in the generator.
Advice from linuxquestions.org:
Quote:We used to open text file in append mode in C language like this:
Quote:file_ptr = fopen("example.txt", "a");
Quote:Opening a file in append mode will not load the whole file into memory.
What I am wondering is, how much memory will appending a line to a file use? As I understand it, appending does not load the entire file into memory. I think it just adds a line at the end of the file. If that is correct, then reading a very big file line by line and saving that line to another file will never overtax the computer's RAM. That could fill up the computer's hard drive, if the files are >300GB as recently mentioned, but that will not use much RAM. So don't try that unless you have a lot of spare hard drive! (I think we should keep the old file, at least until we are happy with the outcome.)
But I'm thinking, if we read a big file with a generator by lines, process each line, then append that line to an output file, then close the output file, at least RAM will never be overtaxed.
However, I don't know anything about the technical side of computer memory and write() operations! And I don't have 300GB spare to try this!
Posts: 1,602
Threads: 3
Joined: Mar 2020
(Jan-12-2026, 12:43 AM)Pedroski55 Wrote: As I understand it, appending does not load the entire file into memory.
You're correct. Neither does writing a file (whether in append mode or not).
What eats up the memory is storing the contents of the entire file (either accidentally or on purpose).
Posts: 1,300
Threads: 151
Joined: Jul 2017
@ bowlofred
If I open the file mode='w', a new file is created. Continuous writing will eventually overload RAM, because the file is still open. That is not good here. So open mode='a'.
What I don't know is: if I open a file mode='a', I believe I get a pointer to the end of the file. (This is CPython)
Now I imagine what happens when I call:
file.write(text) At the end of the file is an EOF character, whatever that looks like. The operating system somehow writes my text to the end of the file and then a new EOF marker. The actual contents of the file are never loaded into RAM, as I understand it.
Is that what happens?
If that is the case, then dealing with any size file by opening mode='a' will never overload RAM. Maybe it will overload your hard drive!
Posts: 232
Threads: 0
Joined: Jun 2019
Hi,
your guessing of what a and w are doing is wrong. Again, it's all written in the Python Documentation: wopen a file for writing. If the file had any content before, it will be deleted and new content is written. a appends new content to the end of the file. If file is not existing (and only then), a new file is created. For a, Python sets a pointer to the end of the file so it knows where to start writing. Python's file objects what a tell method returning the current position in the file, seek jumps to a position in the file. These mehtod's are hardly used in "normal" programming unless you need to do a deep dive into the weeds of low-level operations on a file content.
To see when the content is actually written into the file is probably more an implementation details and, in addition to that, depends on the operating system and possibly the file system. Linux uses for example a kernel-side write buffer (can of course be disabled) so even when a program thinks content is written, it may be still in Linux's write buffer and is scheduled to be flushed to disc at a later point. As far as I know, Windows does not use a write buffer (which is one of the reason why it is fairly safe to pull a USB pen drive from Windows any time when not being actively access while it may result in data loss on Linux if the write buffer wasn't flushed by the kernel or unmounting the device or manual flushing from the command line).
Regards, noisefloor
Posts: 6,981
Threads: 22
Joined: Feb 2020
Try using the buffering argument in open.
with open(test_file, 'r', buffering=X) as infile: Set buffering to 1 for line buffering in text mode. Set buffering to 0 for unbuffered mode. Buffering defaults to -1 which picks a buffering mode based on the file mode and environment. What is your environment? Maybe your default is fully buffered.
Posts: 1,602
Threads: 3
Joined: Mar 2020
(Jan-13-2026, 01:08 AM)Pedroski55 Wrote: If I open the file mode='w', a new file is created. Continuous writing will eventually overload RAM, because the file is still open. That is not good here. So open mode='a'.
This is incorrect. The contents of the file are not stored in RAM. Each individual write is, but the buffer is reused. So writing GB of data to a file in a loop does not consume GB of RAM.
If you do the equivalent of
with open("mybigfile.txt", mode="w") as f:
for i in range(1_000_000_000)
f.write("some data to write")It takes an identical amount of RAM as if you'd opened the file in append mode (or if you repeatedly re-opened the file in append mode).
(Jan-13-2026, 01:08 AM)Pedroski55 Wrote: The actual contents of the file are never loaded into RAM, as I understand it.
That is always true. RAM stores the contents of individual reads and writes, and it holds file/stream buffers to make writing more efficient. Usually these buffers are not huge compared to the size of a machine. The contents of a file are only held in RAM if you take some special step to do so like
f = open("mybigfile")
all_the_data = f.read()
|