Pigz inside python - Reading compressed .gz file much faster

jsmith7279 · Dec-21-2017, 07:21 PM

Hello Pythoners-

I am a linux admin. And one of our users were wondering on how to make the below script faster using pigz or any other multi-threading methods. I have no idea regarding python. Can someone please share on how to make the below part a little bit faster? She said it currently takes around 45minutes to parse on compressed .gz file that is 1GB in size.

if infile.endswith(".gz"):
data = gzip.open(infile, 'rb')
else:
data = open(infile, "r")
outfile = infile.split(".txt")[0] +"_step1.gz"
outdata = gzip.open(outfile, "wb")

## take line by line
for line in data:
line1 = line.rstrip()
if line.startswith("@"):
....
....
....
....
....
outdata.close()
data.close()
print ">Output file: "+ outfile # end of run

Thank you. This is not a homework task. This is a biology lab's problem.

**Larz60+** · (This post was last modified: Dec-21-2017, 08:01 PM by Larz60+.)

you can use python magic
Although this module is in PyPi, the name conflicts with other packages of the same name, so you have to download and install the wheel
To do this:

Get the wheel from PyPi as follows
go to: https://pypi.python.org/pypi/python-magic/
Download the wheel file (Current version): python_magic-0.4.15-py2.py3-none-any.whl
change directory to one containing wheel

from command line, install with:

pip install python_magic-0.4.15-py2.py3-none-any.whl

Once you have that package installed, use the following code to find file type:

def check_filetype(filename):
    f = magic.Magic(mime=True, uncompress=True, filename)
    return f.from_file(filename)

This will avoid having to load entire zip file.
It will return a string of type:

Output:
'text/plain'

See the documentation here: https://github.com/ahupp/python-magic

jsmith7279 · Dec-21-2017, 08:02 PM

Hey Larzo60+

Thanks friend. We are okay with memory. The bottle neck is the read and write speeds which is where the time is being wasted. Do you still know if python-magic helps in those areas?

**Larz60+** · (This post was last modified: Dec-21-2017, 08:14 PM by Larz60+.)

No, let me give you a sample for reading the files ... Be back soon

Please answer this. What is the goal of reading a zip file in this way.
There may already be a package that does what you're trying to do in record time.

Example (built into python) see: https://docs.python.org/3.6/library/gzip.html

jsmith7279 · Dec-21-2017, 08:53 PM

I honestly don't know. I was asked for help to make it faster. Decided to ask someone who knows the left and right of python. I have no clue @Larzo60+.

Thanks

**Larz60+** · Dec-22-2017, 01:44 AM

Hard to write something without knowing what the goal is.

**nilamo** · Dec-29-2017, 07:34 PM

http://aripollak.com/pythongzipbenchmarks/

Looks like the speed depends pretty heavily on which version of python you're running. You might also gain some improvement by wrapping the gzip object in io.BufferedReader.

I wouldn't mind seeing more of your code, though, as 45minutes for 1gb sounds excessive. Depending on what you're doing (and the power of the computer it's running on), maybe we can create a process queue and take advantage of multiple cores/processors.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Reading an ASCII text file and parsing data...	oradba4u	2	2,898	Jun-08-2024, 12:41 AM Last Post: oradba4u
	problems with reading csv file.	MassiJames	3	4,071	Nov-16-2023, 03:41 PM Last Post: snippsat
	Navigating file directories and paths inside Jupyter Notebook	Mark17	5	20,676	Oct-29-2023, 12:40 PM Last Post: Mark17
	Reading a file name fron a folder on my desktop	Fiona	4	3,358	Aug-23-2023, 11:11 AM Last Post: Axel_Erfurt
	Reading data from excel file –> process it >>then write to another excel output file	Jennifer_Jone	0	3,090	Mar-14-2023, 07:59 PM Last Post: Jennifer_Jone
	Reading a file	JonWayn	3	2,733	Dec-30-2022, 10:18 AM Last Post: ibreeden
	Reading Specific Rows In a CSV File	finndude	3	2,649	Dec-13-2022, 03:19 PM Last Post: finndude
	Excel file reading problem	max70990	1	2,329	Dec-11-2022, 07:00 PM Last Post: deanhystad
	Reading All The RAW Data Inside a PDF	NBAComputerMan	4	4,500	Nov-30-2022, 10:54 PM Last Post: Larz60+
	Replace columns indexes reading a XSLX file	Larry1888	2	2,514	Nov-18-2022, 10:16 PM Last Post: Pedroski55

Pigz inside python - Reading compressed .gz file much faster

User Panel Messages

Announcements