Full Site Search is Back Baby!
I’ve relaunched the site’s main search tool. You can now conduct searches via the search box in the upper right.
This means that there are now five different ways to search for stuff on the site. You can type something into the box at the upper right and you’ll receive results from the articles and database (images and other types will be added soon). In the individual sections for the articles, images and database, you can do more advanced searches — specifying date for the article search, publisher, platform and item type (game, peripheral, person, etc.) for the database search, and size and proportion for the image search. You can also do in-page searches by highlighting a term while reading an article.
Look below for some technical details about my search system.
The reason it took me some extra time to make the full search tool is because of two considerations:
- Coming up with a presentation that works for all types of content
- Coming up with a scheme for sorting the results intelligently
For the first problem, I used a Pinterest-style wall of results. When you scroll to the bottom, more results are loaded automatically.
I personally find this type of presentation unreadable, but I wanted to play around with different presentation styles, so I went with this route. I’m not sure if I will keep it in this form.
Like Pinterest, the individual item blurbs are of fixed width but varying height. This currently causes the bottom of the page to be uneven — sometimes extremely so, if some columns have lots of items without images. I can think of a couple of ways to fix this — adding new results to shorter columns first, for instance. I’ll implement those schemes later. I also currently don’t assign heights to my images, which causes the page to animate downward when new results are loaded in.
For the second problem, the sorting scheme, Solr and Lucene gives you plenty of power in this area. I assign a score to individual documents at index time and give boosts to certain fields during query time. When grabbing the results, I make Solr sort by score.
To determine a document’s score at index time, I use two broad considerations:
I compute the document’s recency score and popularity score, then add them together to get the full document’s score.
The meaning of recency and popularity depends on the type of document.
- For content and images, recency means when the content was posted. Newer content has a higher recency score. For database entries, recency means how close to the current date the item’s release date is. Items that were just released or are soon to be released have higher recency scores compared to older items and items that are farther off.
- For content, popularity includes page views, number of comments, and social media cues like “likes” and “retweets” and so-forth. For database items, popularity includes social media cues, and the number of recent images and articles. I currently don’t include the pageviews or social media cues, as my database at present can’t store such “computed” fields (this will change super soon)
For the actual computations, I’ve totally forgotten all my college maths (I’m the Berkeley math department’s most shameful outcome) but I found a few recommendations at this page.
Recency is computed using a reciprocal function A / ( M * X + B)
M, A and B are constants. For content, X is the difference between the current time stamp and the time stamp of the article. For database items, X is the absolute value of the difference between the current date and the release date (so games released three weeks ago get the same score as games coming out three weeks for now — that may not be a good idea).
Changing the constants changes the shape of the curve. You can make the curve really steep so that articles that you just posted have a super huge boost, making sure they will be at the top of results.
See page 7 of the above link for further details on the function’s use.
Popularity is computed using this:
(0.6 * recent) + (0.4 * lastWeek) + (0.2 * lastMonth) …
The “recent”, “lastWeek,” etc. is a figure that reflects the amount of popularity for that given time period. For content, I just have the number of comments for now, but will include page views soon. For database items, I use the number of images and articles added for the item over the period. The multipliers were obtained by trial and error.
Outside of the document score, I also have a query-time boost system in place when the user is doing a text search. Each document has three main text fields of decreasing importance. These three are boosted differently.
For content, the text of the headline and strapline go in the first text field, which gets the biggest boost. The text of the first couple of hundred words and the names of all related objects go in the second text field. The text of the rest of the document goes in the third text field.
For database items, the various names — english, Japanese, roman, etc. — all go in the first field. The second field has the names of related objects — for a game, this could be PlayStation 3 (the platform), Square Enix (the publisher), and the development staff. Including those related object fields means that entering Square Enix into the search box will return all Square Enix published games. Square Enix the company (and a game with the word “Square Enix” in the title) should come out on top of all those games, though, because “Square Enix” is in the first text field.
The actual numbers for all the boosts, the way you combine the different boosts, and the determination of what items go in what fields is a part of the search engine’s “secret sauce” (to borrow a term used by people who probably need to be punched in the face). I’m sure there’s a lot of science to it, but I just played around with the figures until I was happy with the results.