Vector Search App MVP

Background

Last year, I wrote a news recommendation algorithm for WantToKnow.info. You can read about the project here and test out the recommendations by clicking on any article title in our archive. The recommendations are based on something called TF-IDF vector cosine similarity, which is to say the mathematical relationships between news stories.

More recently I was inspired to expand the underlying tech to vector search. WantToKnow has good search already, but it's keyword based. My thinking is that vectorizing search queries and then comparing query vectors with news article vectors could potentially surface good stories in situations where keywords alone aren't cutting it.

Success

Today I got a vector search app to the minimum viable product stage. I made a web page that takes any detailed question or description about any conspiracy-related topic as input and outputs a list of the 20 most relevant news article summaries. All of the logic is python, glued to the html with Pyscript, with a csv file stored on IPFS instead of a database.

<!DOCTYPE html> <html lang="en"> <head> <title>WantToKnow Archive Vector Search</title> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width,initial-scale=1"> <link rel="stylesheet" href="https://pyscript.net/releases/2023.05.1/pyscript.css" /> <script defer src="https://pyscript.net/releases/2023.05.1/pyscript.js"></script> <style> body { margin-left: 20%; margin-right: 20%; } #mainstory { color: white; background-color: black; padding: 10px } textarea { width: 100%; height: 150px; padding: 12px 20px; box-sizing: border-box; border: 5px solid black; background-color: #f8f8f8; font-size: 16px; resize: none; } button { width: 100%; color: white; background-color: black; font-size: 24px; text-align: center; padding: 12px; } button:hover { color: black; background-color: white; } </style> </head> <body> <py-config> packages = [ "pandas", "scikit-learn" ] terminal = false </py-config> <h1>WantToKnow.info Archive Vector Search</h1> <p>Find news article recommendations based on term frequency-inverse document frequency (TF-IDF) vector cosine similarities. A search returns the 20 most closely related summaries.</p> <p><strong>Instructions:</strong> enter a question or statement. When it comes to conspiracies and cover-ups, what do you most want to know? Be as detailed as possible. Five or six sentences is optimal. Press the submit button only once and wait for the data to be crunched.</p> <textarea id="askit">What do you want to know?</textarea> <button id="submit-btn">Submit Query for Processing</button> <div id="mainstory"></div> <div id="relatedstories"></div> <script type="pyscript"> import pandas as pd import re from js import console from pyscript import when, display from pyodide.http import open_url from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity @when('click', '#submit-btn') def query(): question = Element('askit').element.value Element('mainstory').write(question) url = 'the IPFS url of my csv file' df = pd.read_csv(open_url(url), sep='|', usecols=['ArticleId','Title','PublicationDate','Publication','Links','Description','Priority','url']) # Deduplication and NaN cleanup df = df.drop_duplicates('Title') df = df[df['Priority'].notna()] # Substituting multiple spaces with single space df['Description'] = df['Description'].apply(lambda x: re.sub(r'\s+', ' ', str(x))) # Remove double quotes df['Description'] = df['Description'].apply(lambda r: r.replace('\"\"', '\"')) # Remove paragraph styling df['Description'] = df['Description'].apply(lambda r: r.replace('', '')) df['Description'] = df['Description'].apply(lambda r: r.replace('', '')) df['Description'] = df['Description'].apply(lambda r: r.replace('', '')) df['Description'] = df['Description'].apply(lambda r: r.replace(' ', '')) query_row = pd.DataFrame({'ArticleId': '54321','Title': 'Search Terms','PublicationDate': '','Publication': '','Links': '','Description': 'Variable','Priority': '','url': ''}, index=[0]) df = pd.concat([query_row, df]).reset_index(drop=True) df.at[0, 'Description'] = question # Compute TF-IDF vectors and cosine similarities vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(df['Description']) cosine_similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix).flatten() # Find the 20 most similar articles similar_indices = cosine_similarities.argsort()[-21:-1][::-1] similar_items = df.iloc[similar_indices] # Display the results in the specified format result_html = "" for index, row in similar_items.iterrows(): for col in df.columns: result_html += f"{col}: {row[col]} " result_html += " " display(result_html, target="relatedstories") </script> </body> </html>

As of now, the results display needs work, but the thing is basically operational. Calling the main function with an event-listening decorator still seems weird to me, but this was the only way I could get it to work. I ended up using gpt-4 to get the cosine similarities computed efficiently and was surprised by how much better gpt-4 is compared to gpt-3.5.

When I first started this project, my plan was to pre-compute the vectors to conserve browser resources. But storing the vectors in the csv made its size balloon from 27MB to 3.5GB. So I instead went with browser-computed vectors and it actually seems okay. A search takes well under a minute, with excellent results relevance.

As for next steps, after cleaning up the display, there are a few directions I could take the project in. I'd like to embed a Telegram group discussion in the page, but the available embed widget doesn't work, so I could try to do something with their API. I'm also looking at trying to send search results to gpt to generate a 500 word summary brief of the material. That might be pretty cool.

Read my novels:

Small Gods of Time Travel is available as a web book on IPFS and as a 41 piece Tezos NFT collection on Objkt.

The Paradise Anomaly is available in print via Blurb and for Kindle on Amazon.

Psychic Avalanche is available in print via Blurb and for Kindle on Amazon.

One Man Embassy is available in print via Blurb and for Kindle on Amazon.

Flying Saucer Shenanigans is available in print via Blurb and for Kindle on Amazon.

Rainbow Lullaby is available in print via Blurb and for Kindle on Amazon.

The Ostermann Method is available in print via Blurb and for Kindle on Amazon.

Blue Dragon Mississippi is available in print via Blurb and for Kindle on Amazon.

See my NFTs:

Small Gods of Time Travel is a 41 piece Tezos NFT collection on Objkt that goes with my book by the same name.

History and the Machine is a 20 piece Tezos NFT collection on Objkt based on my series of oil paintings of interesting people from history.

Artifacts of Mind Control is a 15 piece Tezos NFT collection on Objkt based on declassified CIA documents from the MKULTRA program.