Python Forum
Possible bug found, please advice
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Possible bug found, please advice
#1
Last days I have ran into a strange issue with my Python script stopping at random.

I made a script that basically sanitized an extremely large text file (couple of hundred GB's). I had problems with programs using that large file and I suspected it to have content that was either too large (per line) or containing bad characters. To verify this I made a script that copies over all the lines within a certain size range and removing all 0x00's characters to a new file.

My script then aborted at random places. Sometimes at less then 1% progress, sometimes halfway in. It basically just stopped without any error message. It just gave me my cmd prompt (running from Win11 command line) without any hints. My code had try-catch statements with printing of errors in the catch. So the catch was never triggered. The behavior was very random but I never succeeded in processing the whole file. The random behavior basically says that it is not related to a point in my input file. In later iterations of my script I also did get some weird exceptions meaning that my time object was suddenly not assigned to a real object anymore and later something similar that a bool object was not callable anymore. These exceptions also happened at random and on lines that had been executed thousands of time before (in the same run) without problem.

To me it looks like a memory problem. Somehow my Python script corrupts the systems memory resulting on objects not being assigned anymore and suddenly aborting the script without error message.

Possible sometimes even causing a complete system freeze although that was much less reproduceable.

Suspected code for this behavior:
1) My progress update:
sys.stdout.write(progress_line)
sys.stdout.flush()

2) My file handlers:
bufsize=1024*1024
with open(src, "rb", buffering=bufsize) as fin, open(dst, "wb", buffering=bufsize) as fout:
while True:
line = fin.readline()

I am a bit hesitant to share my complete script as I am not sure how much of a possible exploitable problem is at play here. Researching this further is very time consuming as each run could take several hours and would only be worth it when it brings me somewhere.

My question:
Is this a bug or possibly even a vulnerability (referencing random memory) and should I report this somewhere?

Meanwhile I solved it by rewriting my script in .net. Processed the file without problems (at half the speed) into a new file and the new file was indeed working, proving that my input file had indeed either too large or lines with 0x00's in it (or both: long lines with many 0x00's). Reading the 'corrupt' file with a large file reader also caused the large file reader to crash. Even did try reading it with notepad++ and that also crashed after a long time reading it. So maybe it is even OS related? I tried running from a different drive in the system, but that did not help either.

So my problem is 'solved' (although not with Python), just need to know if it is of someone's interest to dig further into this.
Reply
#2
If the file contains "bad characters", whatever that means. You could perhaps read it by fixed size chunks instead of using fin.readline().
« We can solve any problem by introducing an extra level of indirection »
Reply
#3
Hi,

well, the core problem here is that you use a lot of your own words to describe a problem - but it doesn't really help. Without the actual code and input data it is impossible trying to reproduce the problem and check for a possible solution.

Yes, it may indeed be certain (non printable) characters in the data. Yes, it may indeed by OS-related. The latter could be fairly easy verified by running the same script with the same in file on Linux or MacOS. If may be hardware-related, as the machine you run the script on has not enough memory or the memory has defects which only come to play when heading towards 100% memory usage.

General recommendations are: strip down the script as much as you can - use print only, not write to stdout instead of printing. Let Python choose the buffer size. Don't use try / except - let the script crash if it crashes to ensure you get all error messages. Ensure you open the text file with the correct encoding to prevent that Python guess or uses the wrong encoding. Ensure the in file is always closed, even when the script crashes will working on the file. Iterate over the in file line by line, don't read it completely or define your own chunks to avoid having no or unexpected line breaks.

Regards, noisefloor
Reply
#4
Python should not suddenly at random stop a running script without an error message, right?
Python should not suddenly de-assign an object at random, right?

So there is a problem in de Python code that is not handling something in the correct way.

Note that I solved my problem with the same script in .net and I am already using my fixed file without problems now. I'm happy to do more testing and pinpoint in more detail what is actually happening but that is a lot of work/time. I am happy to do so, but if the result is a more detailed bug report with steps to reproduce it, what is going to happen with it? Will someone use it to actually make an improvement? If so, I am happy to participate. if not, I am not spending time on it. So my simple question is: Is there anyone interested in this?
Reply
#5
(Jan-05-2026, 08:58 PM)Hassher Wrote: Python should not suddenly at random stop a running script without an error message, right?
Python should not suddenly de-assign an object at random, right?

So there is a problem in de Python code that is not handling something in the correct way.
This is a bug in your script, not a bug in the Python language (with 99.99% probability). In the unlikely event that this is a bug in the Python language, we are not Python core developers here, so we won't use the bug report to improve the Python language. Read this official documentation page to see how to report bugs in the language.

If you post here a reasonable piece of code that fails, members of the forum may be able (and happy) to tell you where your code is faulty.
buran, noisefloor, Pedroski55 like this post
« We can solve any problem by introducing an extra level of indirection »
Reply
#6
There is no reason to use try/except in most scripts. Python does a good job of reporting exceptions are raised. The try/except might be hiding that the program crashed. Use try/except if the program can make a correction and continue.

It’s likely that the logic is wrong and your program terminates because it reached the end in a way you don’t see.
Reply
#7
@Hassher: as long as you do not provide any code and data, it's pointless to keep on talking. At this point, nobody except you knows what you are doing so it's pointless to ask for help without sharing in detail what you are doing.

The good news for you is that the actual problem is solvable, as you wrote a working solution in .NET - good for you. If .NET is the better tool for you and if gives the result reliably you are looking for, it's perfectly fine.

Regards, noisefloor
Reply
#8
@deanhystad, @noisefloor, @Gribouillis, There is no error in my script! There also cannot (should not) be errors that behave like this (at random stopping without error and at random de-assignment of objects) in Python.

For all you in doubt, here is the script:

#!/usr/bin/env python3
import argparse, os, sys
from time import perf_counter  #previously used time, but tried perfcount to see if the suddenly disappearing time object could be solved by this. It did not solve it....
import traceback

def fmt_bytes(n):
    for unit in ("B","KB","MB","GB","TB"):
        if n < 1024 or unit == "TB": return f"{n:.2f} {unit}"
        n /= 1024

def fmt_eta(sec):
    sec = int(sec if sec >= 0 else 0)
    h, r = divmod(sec, 3600); m, s = divmod(r, 60)
    return f"{h:02d}:{m:02d}:{s:02d}"


def filter_lines(src, dst, n, progress_every=0.5, bufsize=1024*1024, z=255): #, enc_errors="replace"):
    total = os.path.getsize(src)
    start = last = perf_counter() #mytime.time()
    last_tell = 0
    kept = seen = 0

    # NOTE: compares *byte* length of each line (fast, encoding-agnostic).
    with open(src, "rb", buffering=bufsize) as fin, open(dst, "wb", buffering=bufsize) as fout:
        while True:
            line = fin.readline()
            if not line: break
            seen += 1
            if b"\x00" in line:
                #sys.stderr.write("\nRemoving zero's\n")
                line = line.replace(b"\x00", b"")
            size = len(line.rstrip(b"\r\n"))
            if size >= n and size <= z:
                fout.write(line)
                kept += 1

            now = perf_counter()
            if now - last >= progress_every:
                done = fin.tell()
                elapsed = now - start
                speed = (done / elapsed) if elapsed > 0 else 0.0
                eta = (total - done) / speed if speed > 0 else 0
                if done != last_tell:  # avoid noisy zero-progress updates                    
                    progress_line = (
                        f"\r{done*100/total:6.2f}%  "
                        f"{fmt_bytes(done)}/{fmt_bytes(total)}  "
                        f"spd: {fmt_bytes(speed)}/s  "
                        f"ETA: {fmt_eta(eta)}  "
                        f"lines: kept {kept:,} / seen {seen:,} "                    
                    )
                    sys.stdout.write(progress_line)
                    sys.stdout.flush()         
                    last_tell = done
                last = now
    # final line
    sys.stdout.write(
        f"\r100.00%  {fmt_bytes(total)}/{fmt_bytes(total)}  "
        f"spd: --/s  ETA: 00:00:00  lines: kept {kept:,} / seen {seen:,}\n"
    )
    sys.stdout.flush()

def main():
    ap = argparse.ArgumentParser(
        description="Copy lines of length >= N from a huge .txt to a new file (fast, streaming, with progress & ETA)."
    )
    ap.add_argument("sourcefile")
    ap.add_argument("destfile")
    ap.add_argument("n", type=int, help="minimum line length (in BYTES)")
    ap.add_argument("--bufsize", type=int, default=1024*1024, help="I/O buffer size (bytes), default 1 MiB")
    ap.add_argument("--progress-every", type=float, default=1.0, help="progress update interval in seconds")
    ap.add_argument("--maxlen", type=int, default=255, help="maximum line length, default is 255")
    args = ap.parse_args()
    
    #tried this to try to get it to not catch an exception, but same result!
    #filter_lines(args.sourcefile, args.destfile, args.n, args.progress_every, args.bufsize, z=args.maxlen) #, args.enc_errors)
    
    try:
        if not os.path.exists(args.sourcefile):
            ap.error(f"Source not found: {args.sourcefile}")
        if os.path.abspath(args.sourcefile) == os.path.abspath(args.destfile):
            ap.error("Destination must be a different file.")
        filter_lines(args.sourcefile, args.destfile, args.n, args.progress_every, args.bufsize, z=args.maxlen) #, args.enc_errors)
    except KeyboardInterrupt:
        sys.stderr.write("\nInterrupted by user. Partial output preserved.\n")
        sys.exit(130)
    except Exception as e:        
        sys.stderr.write(f"\nERROR: {e}\n")
        traceback.print_exc()
        #print("mytime:", mytime, file=sys.stderr)
        #print("mytime.time:", mytime.time, type(mytime.time), file=sys.stderr)
            
        sys.exit(1)

if __name__ == "__main__":
    main()
To test it, make a file of about 350GB in size with random ascii lines varying in length between 1 and 255 characters.

Meanwhile I did a scan of my input file (with my .net program) and can now confirm there is no long lines and no 0x00's in the file. Also a point in the input file would not explain the random crashing of python (at different positions in the input file processing). So it seems to be happening on just-a-large-file.
Reply
#9
Thanks for sharing the code.

As said: strip down the code to the bare minimum of what's necessary to perform what is supposed to be performed and see if the script still crashes.

Specifically:
* remove all performance measurement
* remove all output like progress status etc.
* don't specify any buffersize
* remove all error handling

If the script then work -> the problem is in one of the other parts of the code.

Except this:
* os.path is a legacy module, use the newer pathlib instead. Path(filename_with_full_path').stat().st_size gives the file size.
* Your way of iterating over the lines is strange. More common is:
with open(file, 'rb') as in_file:
    for line in in_file:
        #do somthing with line
* Is there a specific reason using sys.stdout and sys.stderr for output instead of simply using print()or, in case logging is required, the logging module? It's quite odd to do it this way.

Regards, noisefloor
Gribouillis likes this post
Reply
#10
Courtesy of National Center for Atmospheric Research at github (adapted)

Of course, don't bother printing anything if you are dealing with millions of lines!

test_file = '/home/peterr/temp/bits&bytes.txt' 

def gen_loader(test_file):
    with open(test_file, 'rb', ) as infile:
        for line in infile:
            yield line

for line in gen_loader(test_file):
    print(f' line = {line}')
    newline = line.replace(b'\0', b'').replace(b'\x00', b'')
    print(f'newline = {newline}')
You get something like this:

Output:
line = b'Hello binary world \x00 \n' newline = b'Hello binary world \n' line = b'This line has no null byte!\n' newline = b'This line has no null byte!\n' line = b'Hello binary world \x00 \n' newline = b'Hello binary world \n'
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How can I found how many numbers are there in a Collatz Sequence that I found? cananb 2 4,290 Nov-23-2020, 05:15 PM
Last Post: cananb

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020