Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
MarkdownCommander
#1
I have tons of markdown documents that i have collected for some time. Whenever I need to find one that I put together sometime in the past, the time consummed in actually finding the document often exceeded the value of it's contents. I decided a tool was needed that would allow quick access to any document, and perhaps display the results as well. So I came up with the idea for a Markdown Commander. I laid out the general plan, and decided I would use AI to help with the coding. My AI of choice is XGrok, and the contribution I received from Grok was tremendous. This turned out to be such a useful too, that I thought I'd share the code.

So here's the steps needed to build it on your system. I built this on Linux Mint 21.3, and haven't tried it on any other OS, but I beleive it should run with little or no modification.

  1. Create a project directory and name it MarkdownCommander.
  2. cd to that directory.
  3. Create a virtual environment, using whatever tool you like. I use pythons venv like so: python -m venv venv
  4. Start the virtual environment: . ./venv/bin/activate
  5. install wxpython using the following command: pip install -U -f https://extras.wxpython.org/wxPython4/extras/linux/gtk3/ubuntu-22.04 wxPython
    This command is for Ubuntu system, using gtk which is what linux mint 21.3 uses as it's base. You may have to play with this for your OS. wxpython can be fussy.
  6. Cleanup apt: sudo apt autoclean
  7. Update apt: sudo apt update
  8. Install wxwidgets using: sudo apt install libwxgtk3.0-dev
  9. Install mistune: pip install mistune
  10. install sentence_transformers: pip install sentence_transformers
  11. Create a src directory, and cd to that directory:
    mkdir src
    cd src
  12. create an empty __init__.py file: touch __init__.py
  13. Using you favorite text editor, add MarkdownCommander.py:
    import wx
    import wx.html2
    import mistune
    from pathlib import Path
    from sentence_transformers import SentenceTransformer, util
    import threading
    import torch
    from datetime import datetime
    
    
    # ====================== SAFE TORCH LOAD ======================
    torch.serialization.add_safe_globals([datetime])
    
    
    class SemanticIndexer:
        def __init__(self, model_name="all-MiniLM-L6-v2"):
            self.model = SentenceTransformer(model_name)
            self.embeddings = None
            self.file_paths = []
            self.file_names = []
            self.file_mtimes = []
            self.source_dir = None
            self.build_time = None
            self.last_update_time = None
            self.exclude_dirs = {'.git', 'node_modules', '__pycache__', 'venv', 'env', '.venv'}
    
        def _get_file_info(self, path: Path):
            try:
                text = path.read_text(encoding='utf-8', errors='ignore')[:12000]
                stat = path.stat()
                return text, stat.st_mtime
            except:
                return None, None
    
        def build_index(self, directory: str):
            directory = Path(directory)
            self.source_dir = directory
            texts = []
            self.file_paths.clear()
            self.file_names.clear()
            self.file_mtimes.clear()
    
            for p in directory.rglob("*.*"):
                if any(ex in p.parts for ex in self.exclude_dirs):
                    continue
                if p.suffix.lower() not in {'.md', '.txt', '.py', '.html', '.rst'}:
                    continue
                text, mtime = self._get_file_info(p)
                if text:
                    texts.append(text)
                    self.file_paths.append(str(p))
                    self.file_names.append(p.name)
                    self.file_mtimes.append(mtime)
    
            print(f"Encoding {len(texts)} documents...")
            self.embeddings = self.model.encode(texts, convert_to_tensor=True, show_progress_bar=True)
            self.build_time = datetime.now()
            self.last_update_time = datetime.now()
    
        def search(self, query: str, top_k=10):
            if self.embeddings is None:
                return []
            query_emb = self.model.encode(query, convert_to_tensor=True)
            hits = util.semantic_search(query_emb, self.embeddings, top_k=top_k)[0]
            return [(self.file_paths[h['corpus_id']], 
                     self.file_names[h['corpus_id']], 
                     h['score']) for h in hits]
    
        def check_and_update(self):
            if not self.source_dir or self.embeddings is None:
                return False, 0
    
            current_files = {}
            for p in self.source_dir.rglob("*.*"):
                if any(ex in p.parts for ex in self.exclude_dirs):
                    continue
                if p.suffix.lower() not in {'.md', '.txt', '.py', '.html', '.rst'}:
                    continue
                _, mtime = self._get_file_info(p)
                if mtime:
                    current_files[str(p)] = (p, mtime)
    
            changed_count = 0
            to_keep = [i for i, path in enumerate(self.file_paths) if path in current_files]
            if len(to_keep) != len(self.file_paths):
                changed_count += len(self.file_paths) - len(to_keep)
                self.file_paths = [self.file_paths[i] for i in to_keep]
                self.file_names = [self.file_names[i] for i in to_keep]
                self.file_mtimes = [self.file_mtimes[i] for i in to_keep]
                self.embeddings = self.embeddings[to_keep]
    
            existing_set = set(self.file_paths)
            to_add_texts = []
            to_add_paths = []
            to_add_names = []
            to_add_mtimes = []
    
            for path_str, (p, mtime) in current_files.items():
                if path_str not in existing_set:
                    text, _ = self._get_file_info(p)
                    if text:
                        to_add_texts.append(text)
                        to_add_paths.append(path_str)
                        to_add_names.append(p.name)
                        to_add_mtimes.append(mtime)
                        changed_count += 1
                else:
                    idx = self.file_paths.index(path_str)
                    if abs(mtime - self.file_mtimes[idx]) > 2.0:
                        text, _ = self._get_file_info(p)
                        if text:
                            to_add_texts.append(text)
                            to_add_paths.append(path_str)
                            to_add_names.append(p.name)
                            to_add_mtimes.append(mtime)
                            self.file_paths.pop(idx)
                            self.file_names.pop(idx)
                            self.file_mtimes.pop(idx)
                            self.embeddings = torch.cat([self.embeddings[:idx], self.embeddings[idx+1:]])
                            changed_count += 1
    
            if to_add_texts:
                new_emb = self.model.encode(to_add_texts, convert_to_tensor=True, show_progress_bar=False)
                self.embeddings = torch.cat([self.embeddings, new_emb]) if len(self.embeddings) > 0 else new_emb
                self.file_paths.extend(to_add_paths)
                self.file_names.extend(to_add_names)
                self.file_mtimes.extend(to_add_mtimes)
    
            if changed_count > 0:
                self.last_update_time = datetime.now()
    
            return changed_count > 0, changed_count
    
        def save(self, save_dir: Path):
            save_dir.mkdir(parents=True, exist_ok=True)
            torch.save({
                'embeddings': self.embeddings,
                'file_paths': self.file_paths,
                'file_names': self.file_names,
                'file_mtimes': self.file_mtimes,
                'source_dir': str(self.source_dir) if self.source_dir else None,
                'build_time': self.build_time,
                'last_update_time': self.last_update_time,
            }, save_dir / "index.pt")
    
        def load(self, save_dir: Path):
            data = torch.load(save_dir / "index.pt", weights_only=True, map_location='cpu')
            self.embeddings = data['embeddings']
            self.file_paths = data['file_paths']
            self.file_names = data['file_names']
            self.file_mtimes = data.get('file_mtimes', [])
            src = data.get('source_dir')
            self.source_dir = Path(src) if src else None
            self.build_time = data.get('build_time')
            self.last_update_time = data.get('last_update_time')
    
    
    # ====================== MAIN APP ======================
    class SemanticSearchApp(wx.Frame):
        INDEX_DIR = Path.home() / "semantic_search_index"
        DARK_BG = wx.Colour(30, 30, 30)
        DARK_FG = wx.Colour(230, 230, 230)
        LIGHT_BG = wx.Colour(255, 255, 255)
        LIGHT_FG = wx.Colour(0, 0, 0)
    
        def __init__(self):
            super().__init__(None, title="Markdown Commander", size=(1280, 820))
            self.indexer = None
            self.current_results = []
            self._update_lock = threading.Lock()
            self.is_dark_mode = False
            
            self._build_ui()
            self._try_auto_load_index()
            self.Show()
    
        def _build_ui(self):
            splitter = wx.SplitterWindow(self, style=wx.SP_LIVE_UPDATE | wx.SP_3D)
            left = wx.Panel(splitter)
            sizer = wx.BoxSizer(wx.VERTICAL)
    
            # Query area
            query_box = wx.BoxSizer(wx.HORIZONTAL)
            wx.StaticText(left, label="Query:", size=(50, -1))
            self.query_ctrl = wx.TextCtrl(left, style=wx.TE_PROCESS_ENTER, size=(-1, 40))
            self.query_ctrl.Bind(wx.EVT_TEXT_ENTER, self.on_search)
            query_box.Add(self.query_ctrl, 1, wx.EXPAND | wx.RIGHT, 8)
    
            self.search_btn = wx.Button(left, label="Search", size=(100, 40))
            self.search_btn.Bind(wx.EVT_BUTTON, self.on_search)
            query_box.Add(self.search_btn, 0, wx.ALIGN_CENTER_VERTICAL)
    
            sizer.Add(query_box, 0, wx.EXPAND | wx.ALL, 10)
    
            wx.StaticText(left, label="Results (double-click to open):")
            self.results_list = wx.ListCtrl(left, style=wx.LC_REPORT | wx.LC_SINGLE_SEL | wx.LC_HRULES)
            self.results_list.InsertColumn(0, "Score", width=90)
            self.results_list.InsertColumn(1, "Document", width=580)
            self.results_list.Bind(wx.EVT_LIST_ITEM_ACTIVATED, self.on_result_click)
    
            sizer.Add(self.results_list, 1, wx.EXPAND | wx.ALL, 10)
            left.SetSizer(sizer)
    
            # ==================== RIGHT PANEL - HTML PREVIEW ====================
            right_panel = wx.Panel(splitter)
            right_sizer = wx.BoxSizer(wx.VERTICAL)
            
            toolbar = wx.BoxSizer(wx.HORIZONTAL)
            self.clear_btn = wx.Button(right_panel, label="Clear Preview")
            self.clear_btn.Bind(wx.EVT_BUTTON, self.on_clear_preview)
            toolbar.Add(self.clear_btn, 0, wx.ALL, 5)
            right_sizer.Add(toolbar, 0, wx.EXPAND | wx.LEFT | wx.RIGHT, 5)
    
            # self.preview = wx.html.HtmlWindow(right_panel, style=wx.SUNKEN_BORDER)
            self.preview = wx.html2.WebView.New(right_panel)
            # self.preview.SetStandardFonts(10)
            right_sizer.Add(self.preview, 1, wx.EXPAND | wx.ALL, 5)
            
            right_panel.SetSizer(right_sizer)
    
            splitter.SplitVertically(left, right_panel, 680)
            splitter.SetMinimumPaneSize(400)
    
            # Menu
            menu_bar = wx.MenuBar()
            file_menu = wx.Menu()
            file_menu.Append(101, "&Build/Rebuild Full Index...\tCtrl+B")
            file_menu.Append(102, "&Force Full Rebuild\tCtrl+R")
            file_menu.Append(103, "&Save Current Index\tCtrl+S")
            file_menu.AppendSeparator()
            file_menu.Append(105, "Toggle Dark/Light Mode\tCtrl+T")
            file_menu.AppendSeparator()
            file_menu.Append(104, "E&xit")
            menu_bar.Append(file_menu, "&File")
            self.SetMenuBar(menu_bar)
    
            self.Bind(wx.EVT_MENU, self.on_build_index, id=101)
            self.Bind(wx.EVT_MENU, self.on_force_rebuild, id=102)
            self.Bind(wx.EVT_MENU, self.on_save_index, id=103)
            self.Bind(wx.EVT_MENU, self.on_toggle_theme, id=105)
            self.Bind(wx.EVT_MENU, lambda e: self.Close(), id=104)
    
            self.SetStatusBar(wx.StatusBar(self))
            self.SetStatusText("Ready — Build an index to begin")
            self.apply_theme()
    
        def on_clear_preview(self, event):
            self.preview.SetPage("","")
            self.SetStatusText("Preview cleared")
    
        def apply_theme(self):
            bg = self.DARK_BG if self.is_dark_mode else self.LIGHT_BG
            fg = self.DARK_FG if self.is_dark_mode else self.LIGHT_FG
    
            self.SetBackgroundColour(bg)
            self.query_ctrl.SetBackgroundColour(bg)
            self.query_ctrl.SetForegroundColour(fg)
            self.results_list.SetBackgroundColour(bg)
            self.results_list.SetForegroundColour(fg)
            self.Refresh()
            self.Update()
    
        def on_toggle_theme(self, event):
            self.is_dark_mode = not self.is_dark_mode
            self.apply_theme()
            mode = "Dark" if self.is_dark_mode else "Light"
            self.SetStatusText(f"Switched to {mode} mode")
    
        # ==================== Remaining methods ====================
        def _try_auto_load_index(self):
            if not (self.INDEX_DIR / "index.pt").exists():
                return
            if wx.MessageBox("Saved index found. Load it now?", "Load Index", wx.YES_NO | wx.ICON_QUESTION) == wx.YES:
                try:
                    self.indexer = SemanticIndexer()
                    self.indexer.load(self.INDEX_DIR)
                    self._index_ready()
                except Exception as e:
                    wx.MessageBox(f"Load failed:\n{str(e)}", "Load Error", wx.OK | wx.ICON_ERROR)
    
        def on_build_index(self, event):
            dlg = wx.DirDialog(self, "Select folder with your documents")
            if dlg.ShowModal() == wx.ID_OK:
                path = dlg.GetPath()
                self.SetStatusText("Building full index...")
                threading.Thread(target=self._build_index_thread, args=(path,), daemon=True).start()
            dlg.Destroy()
    
        def _build_index_thread(self, folder):
            if self.indexer is None:
                self.indexer = SemanticIndexer()
            try:
                self.indexer.build_index(folder)
                wx.CallAfter(self._index_ready)
            except Exception as e:
                wx.CallAfter(lambda: wx.MessageBox(str(e), "Error"))
    
        def on_force_rebuild(self, event):
            if not self.indexer or not self.indexer.source_dir:
                wx.MessageBox("Please build an index first.", "Info")
                return
            if wx.MessageBox("Clear current index and rebuild everything?", "Force Rebuild", wx.YES_NO | wx.ICON_WARNING) == wx.YES:
                threading.Thread(target=self._build_index_thread, args=(str(self.indexer.source_dir),), daemon=True).start()
    
        def _index_ready(self):
            count = len(self.indexer.file_paths)
            ts = self.indexer.last_update_time.strftime("%Y-%m-%d %H:%M") if self.indexer.last_update_time else ""
            self.SetStatusText(f"Ready — {count} documents • Last updated: {ts}")
    
        def _check_for_updates(self):
            if not self.indexer:
                return
            with self._update_lock:
                updated, changed = self.indexer.check_and_update()
                if updated:
                    wx.CallAfter(self._index_ready)
                    wx.CallAfter(lambda: self.SetStatusText(f"Auto-updated: {changed} new/changed file{'s' if changed != 1 else ''}"))
    
        def on_search(self, event):
            if not self.indexer or self.indexer.embeddings is None:
                wx.MessageBox("Please build an index first.", "Info")
                return
    
            threading.Thread(target=self._check_for_updates, daemon=True).start()
    
            query = self.query_ctrl.GetValue().strip()
            if not query:
                return
    
            self.SetStatusText("Searching...")
            self.results_list.DeleteAllItems()
    
            results = self.indexer.search(query, top_k=12)
            self.current_results = results
    
            for i, (_, name, score) in enumerate(results):
                self.results_list.InsertItem(i, f"{score:.4f}")
                self.results_list.SetItem(i, 1, name)
    
            self.SetStatusText(f"Found {len(results)} results")
    
        def on_save_index(self, event):
            if not self.indexer or self.indexer.embeddings is None:
                wx.MessageBox("Nothing to save.", "Info")
                return
            try:
                self.indexer.save(self.INDEX_DIR)
                wx.MessageBox(f"Index saved to:\n{self.INDEX_DIR}", "Success")
            except Exception as e:
                wx.MessageBox(str(e), "Save Failed")
    
        def on_result_click(self, event):
            idx = event.GetIndex()
            full_path, name, score = self.current_results[idx]
            fpath = Path(full_path)
            
            try:
                content = fpath.read_text(encoding="utf-8", errors="ignore")
                html_content = mistune.html(content)            
    
                full_html = f"""
                <html>
                <head>
                    <style>
                        body {{ font-family: Arial, Helvetica, sans-serif; padding: 20px; line-height: 1.6; }}
                        pre {{ background: #f4f4f4; padding: 12px; border-radius: 4px; overflow: auto; }}
                        code {{ font-family: monospace; }}
                        h1, h2, h3 {{ color: #2c3e50; }}
                        img {{ max-width: 100%; height: auto; display: block; margin: 15px 0; }}
                    </style>
                </head>
                <body>
                    {html_content}
                </body>
                </html>
                """
    
                # Fix: Use base URL so local images load
                base_url = f"file://{fpath.parent.resolve()}/"
                
                self.preview.SetPage(full_html, base_url)
                self.SetStatusText(f"Opened: {name}  (Score: {score:.4f})")
                
            except Exception as e:
                print(f"Preview error: {e}")  # for your console
                self.SetStatusText(f"Could not open file: {e}")
                self.preview.SetPage(f"<html><body><p>Error: {e}</p></body></html>", "")
            
    if __name__ == "__main__":
        app = wx.App(False)
        frame = SemanticSearchApp()
        app.MainLoop()
  14. go back to main directory: cd ..
  15. create a data directory: mkdir data
  16. create a markdown sub directory in data: mkdir ./data/markdown
  17. And an image directory below the markdown directory: mkdir ./data/markdown/images

Gather your markdown files, and load them all into the markdown directory, and any associated images into the images directory.

sentence_transformers uses hugging face machine learning language, so an internet connection is needed to run.

That should be all that's required to get started.

From the project directory, run: python src/MarkdownCommander.py

Once you have built a model, it can be reloaded to speed up the process. You should only need to rebuild when you add new documents.

Here's a screenshot of the program, and the display of a markdown document that contains images:

   

Again, I greatly appreciate the tremendous help that I received from XGrok, which is my choice for python assistance.

If you find that I missd anything, please let me know.

Edited may 17 -- Clarified install steps
Axel_Erfurt likes this post
Reply
#2
Hi,

if I understand correctly, the program does not parse any syntax of the Markdown files, right? If this is the case, the program should work as well with an other purely text-based format like e.g. reST?

Regards, noisefloor
Reply
#3
This is actually a really useful idea — semantic search for personal markdown archives can save a ton of time, especially for large note collections. The UI and markdown preview integration make it feel practical for real daily use as well.

I’ve been wanting to build more utility tools like this myself once my PC bottleneck[/url] stops bullying my workflow 😅 Really nice project overall, and the setup instructions are detailed enough to follow easily.
buran write May-18-2026, 03:52 PM:
Spam link removed
Reply
#4
noisefloor Wrote:if I understand correctly, the program does not parse any syntax of the Markdown files, right?
You are correct. There's no reason why this can't be modified for many other types of files.
Reply
#5
Hello,
Quote:semantic search for personal markdown archives can save a ton of time, especially for large note collections.

Well, if somebody uses a Linux distro with GNOME as the desktop environment, GNOME's Tracker should do the same thing: allow full text search through documents which are indexed by Tracker. Respectively Baloo for KDE Plasma or, more general, any desktop search engine integrating with your Linux system. Not sure what Windows and MacOS offer, but I guess there's something available here, too.
I played around with Tracker a few years back for the German-language Ubuntu Wiki, but I never tried to search through a larger collection of documents with it. So I don't know how good (or bad) it is compared to the solution presented here.

Regards, noisefloor
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020