The above image was made by with Midjourney using the prompt 'a blue python slithering through computer coding numbers.'
Last year, I wrote a news recommendation algorithm for WantToKnow.info. You can read about the project here and test out the recommendations by clicking on any article title in our archive. The recommendations are based on something called TF-IDF vector cosine similarity, which is to say the mathematical relationships between news stories.
More recently I was inspired to expand the underlying tech to vector search. WantToKnow has good search already, but it's keyword based. My thinking is that vectorizing search queries and then comparing query vectors with news article vectors could potentially surface good stories in situations where keywords alone aren't cutting it.
Today I got a vector search app to the minimum viable product stage. I made a web page that takes any detailed question or description about any conspiracy-related topic as input and outputs a list of the 20 most relevant news article summaries. All of the logic is python, glued to the html with Pyscript, with a csv file stored on IPFS instead of a database.
<!DOCTYPE html>
<html lang="en">
<head>
<title>WantToKnow Archive Vector Search</title>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width,initial-scale=1">
<link rel="stylesheet" href="https://pyscript.net/releases/2023.05.1/pyscript.css" />
<script defer src="https://pyscript.net/releases/2023.05.1/pyscript.js"></script>
<style>
body {
margin-left: 20%;
margin-right: 20%;
}
#mainstory {
color: white;
background-color: black;
padding: 10px
}
textarea {
width: 100%;
height: 150px;
padding: 12px 20px;
box-sizing: border-box;
border: 5px solid black;
background-color: #f8f8f8;
font-size: 16px;
resize: none;
}
button {
width: 100%;
color: white;
background-color: black;
font-size: 24px;
text-align: center;
padding: 12px;
}
button:hover {
color: black;
background-color: white;
}
</style>
</head>
<body>
<py-config>
packages = [
"pandas",
"scikit-learn"
]
terminal = false
</py-config>
<h1>WantToKnow.info Archive Vector Search</h1>
<p>Find news article recommendations based on term frequency-inverse document frequency (TF-IDF) vector cosine similarities. A search returns the 20 most closely related summaries.</p>
<p><strong>Instructions:</strong> enter a question or statement. When it comes to conspiracies and cover-ups, what do you most want to know? Be as detailed as possible. Five or six sentences is optimal. Press the submit button only once and wait for the data to be crunched.</p>
<textarea id="askit">What do you want to know?</textarea>
<button id="submit-btn">Submit Query for Processing</button>
<div id="mainstory"></div>
<div id="relatedstories"></div>
<script type="pyscript">
import pandas as pd
import re
from js import console
from pyscript import when, display
from pyodide.http import open_url
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
@when('click', '#submit-btn')
def query():
question = Element('askit').element.value
Element('mainstory').write(question)
url = 'the IPFS url of my csv file'
df = pd.read_csv(open_url(url), sep='|', usecols=['ArticleId','Title','PublicationDate','Publication','Links','Description','Priority','url'])
# Deduplication and NaN cleanup
df = df.drop_duplicates('Title')
df = df[df['Priority'].notna()]
# Substituting multiple spaces with single space
df['Description'] = df['Description'].apply(lambda x: re.sub(r'\s+', ' ', str(x)))
# Remove double quotes
df['Description'] = df['Description'].apply(lambda r: r.replace('\"\"', '\"'))
# Remove paragraph styling
df['Description'] = df['Description'].apply(lambda r: r.replace(''
, ''
))
df['Description'] = df['Description'].apply(lambda r: r.replace(''
, ''
))
df['Description'] = df['Description'].apply(lambda r: r.replace(''
, ''))
df['Description'] = df['Description'].apply(lambda r: r.replace('', ''))
query_row = pd.DataFrame({'ArticleId': '54321','Title': 'Search Terms','PublicationDate': '','Publication': '','Links': '','Description': 'Variable','Priority': '','url': ''}, index=[0])
df = pd.concat([query_row, df]).reset_index(drop=True)
df.at[0, 'Description'] = question
# Compute TF-IDF vectors and cosine similarities
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['Description'])
cosine_similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix).flatten()
# Find the 20 most similar articles
similar_indices = cosine_similarities.argsort()[-21:-1][::-1]
similar_items = df.iloc[similar_indices]
# Display the results in the specified format
result_html = ""
for index, row in similar_items.iterrows():
for col in df.columns:
result_html += f"{col}: {row[col]}
"
result_html += "
"
display(result_html, target="relatedstories")
</script>
</body>
</html>
As of now, the results display needs work, but the thing is basically operational. Calling the main function with an event-listening decorator still seems weird to me, but this was the only way I could get it to work. I ended up using gpt-4 to get the cosine similarities computed efficiently and was surprised by how much better gpt-4 is compared to gpt-3.5.
When I first started this project, my plan was to pre-compute the vectors to conserve browser resources. But storing the vectors in the csv made its size balloon from 27MB to 3.5GB. So I instead went with browser-computed vectors and it actually seems okay. A search takes well under a minute, with excellent results relevance.
As for next steps, after cleaning up the display, there are a few directions I could take the project in. I'd like to embed a Telegram group discussion in the page, but the available embed widget doesn't work, so I could try to do something with their API. I'm also looking at trying to send search results to gpt to generate a 500 word summary brief of the material. That might be pretty cool.
Read Free Mind Gazette on Substack
Read my novels:
See my NFTs: