Mar-24-2021, 02:10 PM
Hi, I have df like:
df = pd.DataFrame(np.array([[1, 10, 20], [1, 5, 8], [1, 5, 15], [1, 13, 14], [1, 18, 21],[2, 2, 2], [1, 21, 100], [1, 1, 50]]),
columns=['id', 'start', 'stop'])
df['valid'] = True
print(f"{df} \n")
>>> id start stop valid
0 1 10 20 True
1 1 5 8 True
2 1 5 15 True
3 1 13 14 True
4 1 18 21 True
5 2 2 2 True
6 1 21 100 True
7 1 1 50 TrueTo mark invalid enries, which have the same id and an overlap in time I came up with the idea:col2 = 'id'
col3 = 'start'
col4 = 'stop'
counter = 0
for index, row in df.iterrows():
id = df.at[index, col2]
start = df.at[index, col3]
stop = df.at[index, col4]
#3 cases for overlapping time
#case 1: index.start <= stop <= index.stop
#case 2: index.start <= start <= index.stop
#case 3: start <= index start and stop >= index.stop
df_temp = df.query(f"{col2} == '{id}' and (({start} <= {col4} <= {stop}) or ({start} <= {col3} <= {stop}) or ({col3} <= {start} and {col4} >= {stop}))")
#sort out same index
df_temp = df_temp.drop([index])
for index in df_temp.index:
if df.loc[index, 'valid'] == True:
df.loc[index, 'valid'] = False
counter += 1
print(f"Affected Rows: {counter}")My solution takes aprox 6s for 0.1% (326 rows), which results with linear interpolation in aprox 100min. Is there a way, to make this faster? I'd appreciate a hint. Thanks.
