Hello everyone,
I explain my situation to you, maybe someone can clear me up and put a piece of my ignorance
.
I had to recover photos from a broken hdd, so I recovered the undamaged sectors and transferred them to an .img file
Through Scalpel (using linux mint), I extracted more than 350GB of images, including many images that were not needed because they were too small.
Realizing the unclean quantity of work, I thought well that I couldn't control 350GB of images by hand...
So I created a script, very quickly, written quite badly, but that works and respects my needs:
In essence, I have 304 folders, with many jpg files for each folder.
The program looks for "large" and working jpg images for each folder.
I am attaching the first code:
If there were more threads doing this, and not one, would time be significantly reduced?
(let's remember that I have 350GB to compare ...)
So, always fast (and obviously badly written), I wrote the same program, modified to work as a thread,
The code:
Actually I got very strange results:
as the number of thredes increased, the sieved files did not increase significantly, but .... decreased:
1 thread = 3181 elements in 30 seconds
2 threads = 648 elements in 30 seconds
3 threads = 764 elements in 30 seconds
304 threads = 166 items in 30 seconds
I hope someone can find my topic interesting, and that this "mystery" has been clarified to me;)
(I don't clearly exclude the possibility that I made some mistakes)
I explain my situation to you, maybe someone can clear me up and put a piece of my ignorance
.I had to recover photos from a broken hdd, so I recovered the undamaged sectors and transferred them to an .img file
Through Scalpel (using linux mint), I extracted more than 350GB of images, including many images that were not needed because they were too small.
Realizing the unclean quantity of work, I thought well that I couldn't control 350GB of images by hand...
So I created a script, very quickly, written quite badly, but that works and respects my needs:
In essence, I have 304 folders, with many jpg files for each folder.
The program looks for "large" and working jpg images for each folder.
I am attaching the first code:
import os, time, shutil, tarfile #fnmatch
from PIL import Image
import shutil
all_files=[]
sext = []
temp = ""
print(time.strftime("%H|%M|%S"))
#chiedo all'utente che estensioni vuole cercare, in questo caso a me interessano i jpg, quindi inserisco: .jpg
#dopo aver scelto le estensioni, scrivo 0 e invio
while True:
sext.append(input("cerca: "))
if sext[-1] == "0":
del sext[-1]
break
sext = tuple(sext)
#naviga fra le cartelle di Files recuperati, e cerca tutti i file che finiscono con "sext", è una tupla di estensioni, in questo caso, mi interessa solo ".jpg"
for parent, directories, filenames in os.walk("/media/mionomeutente/PENNA USB/Files recuperati"):
for x in filenames:
if x.endswith(sext):
fileDaAnalizzare = parent+'/'+x
#apre il file con estensione specificata, e verifica che l'immagine abbia una certa grandezza
try:
im = Image.open(fileDaAnalizzare)
width, height = im.size
if(width > 350 and height >350):
document_path = os.path.join(parent,x)
print(document_path)
#copio semplicemente l'immagine che rispetta le mie esigenze nella cartella grandi
shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi')
except:
pass
print(time.strftime("%H|%M|%S"))Here, I could also stop here, the program is very basic, it works and respects my needs, but then I asked myself:If there were more threads doing this, and not one, would time be significantly reduced?
(let's remember that I have 350GB to compare ...)
So, always fast (and obviously badly written), I wrote the same program, modified to work as a thread,
The code:
import re, os, threading, sys, shutil, random
from PIL import Image
nThreadz = 0
threadz = []
while (nThreadz <= 0):
nThreadz = int(input("numero di thread: "))
all_files=[]
sext = []
temp = ""
while True:
sext.append(input("cerca: "))
if sext[-1] == "0":
del sext[-1]
break
sext = tuple(sext)
listMatch = []
for parent, directories, filenames in os.walk("/media/mionomeutente/PENNA USB/Files recuperati"):
listMatch.append(parent)
print("attualmente, %d siti" %len(listMatch))
class scan(threading.Thread):
def __init__(self, group=None, target=None, name=None, args=(), kwargs=None, daemon=None):
threading.Thread.__init__(self, group=group, target=target, name=name, daemon=daemon)
self.args = args
self.kwargs = kwargs
return
def run(self):
#print(self.args)
global nThreadz, sext
#non è detto che il numero di thread sia divisibile con i listMatch
if (self.args != (nThreadz-1)):
vadoDa = self.args*(len(listMatch)//nThreadz)
vadoA = (self.args+1)*(len(listMatch)//nThreadz)
#print("vado da "+str(vadoDa)+" a "+str(vadoA))
else:
vadoDa = self.args*(len(listMatch)//nThreadz)
vadoA = len(listMatch)-1
#print("vado da "+str(vadoDa)+" a "+str(vadoA))
#ogni thread cerca in una cartella (304 il totale delle cartelle), e si divide il lavoro
for percorso in listMatch[vadoDa:vadoA]:
for parent, directories, filenames in os.walk(percorso):
for x in filenames:
if x.endswith(sext):
fileDaAnalizzare = parent+'/'+x
#print(parent+'/'+x)
#apre il file con estensione specificata, e cerca quel pezzo di codice o frase
try:
im = Image.open(fileDaAnalizzare)
width, height = im.size
if(width > 350 and height >350):
document_path = os.path.join(parent,x)
#print('trovata: '+document_path)
if(not(os.path.exists('/media/mionomeutente/PENNA USB/grandi/'+x))):
shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi')
else:
nomeRandom = random.randint(0,1000000000)
nomeRandom = str(nomeRandom)+x
shutil.copy(document_path, '/media/mionomeutente/PENNA USB/grandi/'+nomeRandom)
except:
pass
#sys.exit(0)
for x in range(0,nThreadz):
threadz.append(scan(args=(x)))
for x in range(0,nThreadz):
threadz[x].start()The operation is identical to the previous script, with the difference that I can choose the amount of threads to "divide" the work ... I thought.Actually I got very strange results:
as the number of thredes increased, the sieved files did not increase significantly, but .... decreased:
1 thread = 3181 elements in 30 seconds
2 threads = 648 elements in 30 seconds
3 threads = 764 elements in 30 seconds
304 threads = 166 items in 30 seconds
I hope someone can find my topic interesting, and that this "mystery" has been clarified to me;)
(I don't clearly exclude the possibility that I made some mistakes)

![[Image: 1*wd0z1C75VsxD42QdKqCjpA.gif]](https://miro.medium.com/max/700/1*wd0z1C75VsxD42QdKqCjpA.gif)