Finding micrometeorites is harder than advertised

July 23, 2021July 24, 2021 / janagrc@gmail.com / Leave a comment

Every day, the Earth runs into hundreds of pounds of space rock, much of it as tiny as a grain of sand. On any night, in a dark location, you can look up and see bright streaks in the sky, meteors often originating from tiny specks. While a minuscule fraction of the rock that creates these brilliant flashes makes it to Earth’s surface, some can be recovered. We usually picture meteorites as large, exotic looking rocks, but often it takes the form of little bits of rock which is friction melted by Earth’s atmosphere into droplets which cool into tiny spheres during their fall. When these spheres make it to the ground, they are called micrometeorites, specifically, micrometeoritic spherules. There are also unmelted or partially melted micrometeorites, but it is easier to identify the melted ones by their spherical shape, surface texture, and composition.

In numerous lesson plans, websites, and books there are claims that a casual amateur can discover micrometeorite spherules on rooftops or even in their backyard. What a wonderful way to feel connected to the universe, and learn about the flux, composition, and formation of these objects! Unfortunately, these resources tend to be unrealistically optimistic about the success of such a search. I hate to be the harbinger of bad news, but love to be the realist.

I don’t doubt it is possible for amateurs to find micrometeorites in the environment. Members of the public have found objects confirmed as micrometeorites by scientists using the type of advanced analysis we undertook, so it is possible. What wasn’t clear is if this is a viable activity for the public or students and educators to undertake with the expectation of success or the ability to accurately assess success. What me and my students, Adele Antalek, Audrey Lew, Stivaly Paulino, Franky Telles, and Namshik Yoon, discovered was that a casual citizen scientist should not expect to find micrometeorites from urban rooftop searches by following widely publicized procedures, and that it is difficult to impossible to determine if you’ve found a micrometeorite without access to advanced instrumentation for chemical analysis.

A group of people hold brooms and pose on a rooftop — MAT Research Practicum students on the rooftop of the Pelham Fritz Recreation Center, New York City

The students were participants in the summer research course, the “research practicum,” as a part of the Master of Arts in Teaching program at the American Museum of Natural History (AMNH). After the program they went on to become earth science teachers, and they come into the program with undergraduate and sometimes graduate degrees in geology or related fields. During the 5 week course we undertook a detailed and painstaking search for micrometeorites from debris collected from rooftops.

I knew this might be a difficult project, and that we might not find any micrometeorites (spoiler alert: we didn’t), but I still thought it worthwhile for several reasons. One, if we did find them, micrometeorites provide information about incoming material from space. They could help us understand the extraterrestrial geochemical flux and can reveal information about both the early solar system as well as the conditions experienced upon atmospheric entry. Because they presumably would have fallen since the roof was built or since it was last cleaned or surfaced, they provide a recent sample to compare to antarctic or sea bed micrometeorites, which can be tens of thousands of years old.

Another motivation for the project is the potential for this as a citizen science and or classroom project. If micrometeorites can be easily found, especially in urban environments, then this is an exciting activity teachers can undertake with their students. Recently a step-by-step guide to amateur micrometeorite searching was published, called On the Trail of Stardust, by jazz musician and micrometeorite enthusiast Jon Larsen, which has greatly expanded interest in micrometeorite searches. We followed the general procedure outlined there, with the addition of further analysis.

My background is in astronomy, not geology, and the last time I took a geology focused course was my high school earth science class. This was a fun opportunity for me to learn from my students, whose backgrounds often involved geological field research. The project also made excellent use of the instrumentation available at AMNH. Most of my astronomy research leans heavily toward data analysis and image processing using Python, but with participants coming in with varying levels of coding expertise and a very limited time I thought it would be better to play into my student’s prior experience. As an astronomer, I never got to “touch my science”, and this was an opportunity to literally get dirty, as well as use an scanning electron microscope, which was as satisfying as I had imagined it would be.

We took debris swept from two rooftops, washed and sanitized it, separated the particles into several size ranges between 150 and 425 microns, and separated out the magnetic particles using strong magnets. Using a dissecting or picking microscope, we identified and isolated spherical particles, and then used their morphology to select a subset of the spherules for SEM imaging and chemical characterization.

We identified 3,445 spherules under the optical microscope, and imaged and chemically characterized 290 of those with the scanning electron microscope. The vast majority were dominated by iron, with many containing trace elements which indicated they were human pollution. There is a small possibility that contained in the iron-dominated spherules there were “I-type” (iron dominated, usually in the form of wüstite) micrometeorites, but unfortunately there is very little possibility of being able to confidently confirm their extraterrestrial origin because they are so similar in characteristics to anthropogenic spherules. Instead we focused our search on silicon-containing micrometeorites which are either dominated by silicates (S-type) or of mixed silicate/iron composition (G-type). Our chemical analysis showed very few spherules contained significant amounts of silicon, and none of them had a similar mix of elements to confirmed sea-bed or Antarctic micrometeorites (see the tertiary diagram of our spherules compared to the tertiary diagram of Genge 16, for example). In other words, it’s very unlikely that any of our spherules were of extraterrestrial origin.

All we can say is that on the roofs we had access to, with great effort on the part of the students, and with access to specialized tools, we weren’t able to find a micrometeorite. Amateur micrometeorite hunters have been successful at finding confirmed micrometeorites, so this is not to say it can’t be done, but I think it’s important to have realistic expectations for success if you undertake such a search.

I also would like to debunk misleading lesson plans on the subject, like this one, that imply that an non-negligible fraction of magnetic spherules students discover are micrometeorites. I think a student could learn a lot from a similar lesson if also presented with information that human pollution is ubiquitous and almost all spherules they find are likely to be anthropogenic in origin. Something that would be quite interesting but out of the scope of this project would be to identify the sources of these spherules. Some possibilities include road dust which is thought to arise from vehicle breaking, welding byproducts, firework debris, and fly ash from coal-fired electric or steam generating plants.

I might be willing to try a micrometeorite search again, but only if I could secure access to an absolutely ideal roof: huge, old, plastic lined, uncleaned for a long period of time, and with raised edges. Unfortunately my experience was that getting access is to such a roof is difficult to impossible in NYC. I spent hours and hours on the phone with corporate offices of building supply companies and warehouses trying to get access to their big, beautiful, disgustingly dirty roofs, but in the end almost everyone said no. Because I wanted the students to be involved in collecting we were further limited to the greater New York City area. It’s possible the search could be more fruitful in less urban environments.

Analyzing your sent emails for Sentiment

September 29, 2019September 30, 2019 / janagrc@gmail.com / Leave a comment

See this tutorial as a Colab Notebook

This tutorial is designed to help you download your own GMail data and analyze it for sentiment. I have left out any results from this tutorial in the interest of privacy – I don’t want people who sent me emails to have their information displayed. Of course, the code is provided “as is”, without warranty of any kind.

First you’ll need to download your GMail data from Google here: https://takeout.google.com/

Remember that if you download your information, the file contains access to emails you’ve sent and/or received, including any passwords, social security numbers, etc contained within. Be safe and guard your data! This process can take a while depending on the amount of data you have.

If you don’t already have it installed, you’ll also need to install StanfordCoreNLP using the following instructions: https://stackoverflow.com/questions/32879532/stanford-nlp-for-python.

We need to import the relevant libraries:

import mailbox
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pylab import *
from pycorenlp import StanfordCoreNLP

And load in the “.mbox” file containing your mailbox information that you downloaded from Google:

mb = mailbox.mbox('YourEmails.mbox')

Next we’ll define some functions we’ll need. The first function, findcharsets finds the unusual character sets:

def findcharsets(msg):
charsets = set({})

for c in msg.get_charsets():
        if c is not None:
                charsets.update()
return charsets

The following function

def getBody(msg):

while msg.is_multipart():
        msg=msg.get_payload()[0]
        t=msg.get_payload(decode=True)
        for charset in findcharsets(msg):
                t=t.decode(charset)
return t

The following code loops through the messages and the keys associated with each one, as well as the text body of the emails, and put them into a Pandas dataframe. Initially we can start with just 20 emails, but you’ll want to increase that once you’re sure the basics are running.

mbox_dict = {}
i=0

number_of_emails = 20

for i in range(0,number_of_emails):
        msg = mb[i]
        mbox_dict[i] = {}
        try:
                  if(msg):
                          for header in msg.keys():
                                  mbox_dict[i][header] = msg[header]
                                  mbox_dict[i]['Body'] = getbody(msg).replace('\n',' ').replace('\r',' ').replace('\t',' ').replace('>','')
        except AttributeError:
                print('Attribute Error')
        except TypeError:
                print('TypeError')

df = pd.DataFrame.from_dict(mbox_dict, orient='index')

Next, look at the contents of your dataframe and see if it has populated correctly with the name of your dataframe, df

You’ll see columns which are the keys associated with the emails, such as “X-Gmail-Labels”, “X-GM-THRID”, etc. One column, “Body”, will contain the actual content of the email.

The following will remove those emails with no content, select out only those emails you sent, and reset the index of the dataframe for good measure.

df = df[df['Body'].notnull()]
df = df[df['From'].str.contains('Your Name Here')]
df = df.reset_index()

Next up, we run StanfordCoreNLP locally and send it our emails one at a time for it to assign sentiment scores. What is actually happening here is it is assigning sentiment to each sentence in the email and then averaging it. This is okay for now, but we should keep in mind this may or may not be a good measure of the actual overall sentiment of the email.

nlp = StanfordCoreNLP('http://localhost:9000')
dates = []
sentiments = []

for i, row in df.iterrows():
        res = nlp.annotate(row['Body'],properties={'annotators': 'sentiment','outputFormat': 'json','timeout': 1000,})

        sentiment_values = []
        if(isinstance(res, dict)):
                for s in res["sentences"]:
                        sentiment_values.append(s["sentimentValue"])
                dates.append(row['Date'])
                sentiments.append(np.mean([int(j) for j in sentiment_values]))
         print("ave sent",np.mean([int(j) for j in sentiment_values]),'_',row['Body'][0:40])

Next, we parse the dates and change them to a easier format:

d = [parser.parse(x) for x in dates]
d = [x.date() for x in d]
d = [x.toordinal() for x in d]

Now we make the lists into numpy arrays, fit a line (just to see), and make a scatter plot, ordinal date vs sentiment, to see if there are any obvious trends over time:

x = np.array(d)
y = np.array(sentiments)
#This pulls only indices where both x and y are finite so we can fit a line
idx = np.isfinite(x) & np.isfinite(y)

#Fit a line to the data and print result
ab = np.polyfit(x[idx], y[idx], 1)
print(ab)

#Plot the data
plt.figure(figsize=(10,5))
plt.scatter(x[idx],y[idx],1,alpha=0.5)
plt.title("Sentiment of Sent Emails")
plt.show()

Here is a plot of the sentiment of my sent GMail emails, plotted over time, including email from August 2015 through September 2019. I didn’t notice any obvious long term trends, but I did notice increases in frequency of sent emails during certain periods. Unsurprisingly, these were around when I had major projects completing.

Once you have your data in this format, there are many possible analyses you might want to look at in more detail, like sent or recieved rates over time, response times as a function of time, who you are sending to or recieving emails from, etc.

Snippets and guidance came from the following sources:

Email import: https://pysd-cookbook.readthedocs.io/en/latest/data/Emails/Email_Data_Formatter.html

https://jellis18.github.io/post/2018-01-17-mail-analysis/

mbox body extraction: https://stackoverflow.com/questions/7166922/extracting-the-body-of-an-email-from-mbox-file-decoding-it-to-plain-text-regard

Stanford Sentiment analysis in python: https://stackoverflow.com/questions/32879532/stanford-nlp-for-python

3D printing a tabletop-size Martian landscape

November 12, 2017November 22, 2017 / janagrc@gmail.com

During the winter of 2016, I was a Science Sandbox Fellow at the New Lab, a futuristic and quirky workspace in the Brooklyn Navy Yard. When I saw the square-meter sized BigRep large-format 3D printer there and was told I had access, I instantly knew I wanted to make a large 3D printed alien landscape. Mars was an easy choice, mainly because I could imagine that in the near future someone might stand on Mars, seeing the same landscape I printed. There is also excellent very high-resolution terrain data available for some regions of Mars, meaning that I could accurately print a region that humans could relate to in scale.

My starting point was to look at the digital digital terrain models, DTMs for short, which are created and released by the HiRISE team. HiRISE is the name of high resolution camera which is run by consortium of NASA Jet Propulsion Laboratory & Arizona Lunar and Planetary Lab. It uses extremely high resolution images taken from two different angles, which are combined with other data to make a highly detailed 3D surface map. The HiRISE camera is on the Mars Reconnaissance Orbiter spacecraft, which has been observing Mars from orbit since March 2006.

An image of a spacecraft with a satellite dish, solar panels, and a gold foil central body over the rusty red cratered surface of Mars, with lines indicating the atmosphere interacting with the motion of the spacecraft.

An image of the Mars Reconnaissance Orbiter aerobraking in the Martian atmosphere upon its arrival. NASA/JPL/Corby Waste

I searched through the several hundred DTMs available at the time looking for particularly interesting terrain to print, and settled on this fascinating layered landscape in the giant Martian canyon Valles Marineris. The canyon measures one fifth the total circumference of Mars, stretching the distance from New York to L.A., and is four times as deep as than the Grand Canyon. The printed area covers a tiny fraction of Valles Marineris located in the southwest part of the sub-region Candor Chasma. This area of Mars is not only visually striking, but also interesting from a scientific perspective – the layered hills in this area are thought to be formed by deposits from extremely salty water early in Mars’ history. More information about the geologic processes that shaped this area can be found in this paper and presentation.

: Mars with Valles Marineris region highlighted

: Valles Marineris with Candor Chasma highlighted

: DTM footprints in Candor Chasma

: The DTM footprint for the 3D printed region in Southwest Candor Chasma

: The region of the 3D print compared to the size of Central Park

: A 3D model of the terrain in Blender

To import the terrain data, I used the HiRISE DTM plugin for the free 3D modeling program Blender, which made it very easy to get the terrain into a mesh format which can be manipulated (see the plugin tutorial). The mesh is so detailed that keeping it at full resolution will likely crash the program, and also is more detailed than our 3D printer can handle. If we were printing at the full resolution of the HiRISE data, we could resolve shockingly small features only ~80 cm in size!

Due to the combination of the printer resolution and the area of the model used, which covers about 4 km x 4 km, the actual resolution of the 3D print is less. After importing the mesh, I extruded the landscape, and cut the bottom. Margaret Hewitt of BigRep was very helpful with cleaning the 3D model using Rhino and preparing it to print. During test printing we found that the print came out much smoother and less “terraced” when printed vertically, standing on a thin edge. The 3D printer used was the BigRep One, and was printed as a set of 16″ x 16″ tiles to help cut down on warping, making a total area of 32″ x 32″. It was printed with heat-tolerant, silver colored PRO HT filament since I though I might be vacuum forming over the tiles.

I wanted the surface to resemble Mars as much as possible, so I gave the print a coat of fine stone texture spray paint, and then layered on other paints using the color HiRISE images as a guide. I sealed with many layers of clear matte finish so that the surface could be touched.

The Marscape is currently being used in astronomy classes at CUNY’s LaGuardia Community College, and will be used for in the future for programs for visually impaired at the American Museum of Natural History and other educational programs. If you have any ideas about how the print can be used, or if you’re in the NYC area and would like to borrow it for educational purposes, please let me know.

Finally I’d like to thank the folks at New Lab, Science Sandbox, BigRep, and the Simons Foundation for supporting the project and generally doing amazing things, particularly Greg Boustead, Hannah Max, Mari Kussman, Alex Susse, and Margaret Hewitt.

Colorfully painting one of my test prints, because why not?

An evening of insect delights

April 3, 2017 / janagrc@gmail.com / Leave a comment

I didn’t expect to spend my Thursday night eating insects, but I ended up at Alchemist’s Kitchen with a panel of proponents of entomophagy. I’m usually a vegetarian, but broke with that rule to try some insect-based offerings. I’m not sure if insects have consciousness or not, and I’d want more study on that before supporting entomophagy on a large scale, but I think it is both possible and beneficial to offset some of our dietary protein with insect protein. There are several argued benefits. One is environmental – insect production takes about three hundred times less water per unit protein than meat production. Another is ethical. While it could be the larger number of individuals per unit protein outweighs everything else, most people would say a single cricket is less capable of suffering than an individual cow.

It’s one thing to say it’s likely to be beneficial, another to argue that western society will accept eating insects when there is such a strong aesthetic prejudice against it. I felt it a little grossed out when eating the insects, but no more than I felt while eating particularly meat-like substitutes like the impossible burger.

: Sweet cricket flour dessert bites

: Rice balls with crickets and sesame seeds

: Wax worm bruschetta

The appetizers at the event were prepared by chef Joseph Yoon, and were delicious. Some of the bites had obvious crickets and wax worms on top, lending them some shock value. Others were subtle, with cricket flour (read: ground cricket) mixed in. The crickets tasted earthy or nutty, the I couldn’t taste the wax worms. I think it was just a coincidence, but I did end the evening with an upset stomach.

The event was hosted by Paul Miller, who pointed out the importance of less water and energy intensive farming methods in the context of global warming. One of the panelists, Robyn Shapiro, sells cricket-flour containing balls through her company Seek. I tried her coconut flavored cricket balls and liked them, they reminded me a bit of Lara Bars. Mitchell Joachim presented his work with Terreform ONE, an architecture firm based at New Lab. They designed a futuristic looking cricket shelter, which is a modular insect farm designed to both house the crickets and encourage mating with acoustic effects.

There are some specific benefits of eating insects I personally find interesting. I’ve been anemic since before I went vegetarian, and I struggle to get enough iron in my diet. Cricket flour contains over three times the iron as the same quantity of spinach, one of the richest vegetarian sources of iron. I can’t see entomophagy catching on without some major cultural shifts, but as the panelists noted, until recently lobster was once considered an undesirable food as well, so tastes change.

Terreform ONE’s futuristic cricket farm design

I grew a plant pot out of mushrooms

March 27, 2017October 21, 2018 / janagrc@gmail.com / Leave a comment

These days I’ve been spending time at New Lab, a shared office space filled with futuristic start-ups working in 3D printing, robotics, design, and A.I., as a part of the Science Sandbox fellowship. Being there sometimes feels a bit like I’m in a mash-up of Star Trek and the Fifth Element, which is exactly where I’d most like to be in the universe. So I suppose I shouldn’t have been surprised when someone walked into my office and says, want to go to a mushroom workshop? Let me think…

The workshop was held by Danielle Trofe, a designer who is headquartered in Industry City. To make our pots, we used the same mycelium-infused material she uses in her MushLume lighting collection, which looks a bit like a hobbit’s version of high design. The same material was used to make a series of unusual benches at New Lab.

Sitting on a fungal bench, nbd.

We started by sanitizing the surfaces of a plastic take out container and plastic drinking cup that together would serve as the mold for our planters. The grow-it-yourself material, which is brown and looks like mulch, is made out of corn husks and other agricultural byproducts which has been infused with liquid mycelium (you can buy it yourself at www.ecovativedesign.com).

We filled the containers and watched as over the course of 4 days, the brown material slowly became more white as the mycelium grew, binding the pot together. I removed the pot from the mold and enclosed in a puffed-up plastic bag so that it could have the freedom to grow a nice, soft white layer on the outside surfaces of the pot without drying out. The final step made me feel strangely guilty – you kill the growth by baking it in at 200 degrees in the oven for an hour. The final product is light and has give, and feels like you grew a pot out of mushrooms. It even has that lovely earthy fungal aroma.

The mycelium pot and its succulent tenant.

Dupe Snoop: Identifying duplicate questions on Quora

February 8, 2017October 21, 2018 / janagrc@gmail.com / Leave a comment

This post describes a project I completed as an Insight Data Science fellow, a program which helps PhD academics transition to careers in Data Science.

Introduction

Quora is a website where users from the general public can ask questions and crowdsource answers. Posted questions can have multiple answers, and users upvote answers according to their usefulness. As Quora has grown, so has their problem with duplicate questions. The Quora management has identified question duplicates as a major threat to the quality of their product. As well as being annoying, repeated questions divide useful answers and their upvotes among the duplicates, making it challenging for users to find useful answers. To address this problem, Quora allows duplicate questions to be merged. Even with this feature, the site is still struggling with a proliferation of duplicate questions.

A screenshot of a Quora question asking why there are so many duplicate questions on Quora, which itself has been merged with a duplicate of itself. Meta.

Data

On January 30th, 2017, Quora released a dataset of over 400 thousand question pairs, some of were asking the same underlying question and other pairs which were not. The non-duplicate pairs were supplemented with questions which were asking about the same topic and represent topically related but semantically distinct questions. The question pairs were labelled by human taggers as true duplicate questions or non-duplicates, although this labeling is not perfect. (Update: I started on the data set the day it was released, but since then it has gained an associated Kaggle competition where you can find examples of a variety of approaches)

Two examples from the dataset, one which is asking the same underlying question and one which is topically related, semantically similar, but syntactically distinct.

Method

For my Insight Data Science project, I used natural language processing (NLP) methods to classify these question pairs as either likely duplicates or distinct questions. This is a challenging NLP problem because of the questions are very short, meaning there are few semantic features to use to determine similarity. Subtle changes can also lead to large differences in meaning. As an example, take the questions, “What do women want from men?” and “What do men want from women?” To a simple bag-of-words (BOW) models, which discard grammar or word order information, these questions look exactly the same. When I applied BOW models to the Quora duplicate problem, this was a common failure mode. This suggested to me that despite the additional complexity and computational investment, the best choice for the problem would be an NLP approach which takes into account syntactic information.

Of the approaches I tested, the one with the highest accuracy is based on skip-thoughts. The reason I was drawn to this model was seeing its success on a task determining semantic similarity between short sentence pairs.

From Kiros et al. 2015. Each sentence pair was evaluated by a human for similarity, which is listed in this figure as GT, or ground truth, scored from 1-5 with 5 being the most similar. Skip thoughts performed favorably on these metrics compared to other state of the art NLP methods.

Skip-thoughts provides a method for encoding sentences as whole as vectors as opposed to ad-hoc sentence encodings created by combining vector embeddings for individual words in a sentence, a common NLP approach. The model I used was trained with a recurrent neural network over a period of several weeks. The input for the unsupervised training was a large corpus of unpublished books containing over 74 million sentences. Resulting sentence vectors are embedded in a 4800 dimensional space, and questions encoded using the model which have similar vector representations should represent semantically and syntactically similar questions.

Visual description of the question comparison process.

I used the skip-thought model to encode each of my questions into a vector representation, and then computed what is called the cosine similarity, which compares the two vector representations accounting for the angle between the two vectors. The output from running this method on a subset of the question pairs shows that pairs with high cosine similarity are more likely to be duplicate questions.

Distributions of both duplicate and non-duplicate question pairs as a function of cosine similarity.

The accuracy of the skip-thought classification as a function of the cosine similarity threshold, and tops out at around 67%. It’s likely that the variation seen in the accuracy for cosine similarity thresholds above is due to noise.

Accuracy of question pair classification for the skip-thought method as a function of the cosine similarity threshold.

The receiver operating characteristic (ROC) curve, which plots the false positive rate against the true positive rate, is commonly used to determine the performance of classification tasks. It shows that my Quora question pair classification does much better than random, which would be represented by the blue dotted line. The area under the ROC curve (known as AUC) is 0.72.

Receiver operating characteristic curve, showing classification is much better than random, with an area under the curve (AUC) of 0.72.

Below a certain threshold in cosine similarity, the algorithm identifies non-duplicate pairs with high confidence, which could eliminate a large population of questions from the pool slated for manual review. For those questions which high cosine similarity, however, the false positive rate is high. This method would be best suited to augment and human duplicate identification and make that process more efficient.

If I had time to invest further work towards improving the accuracy of the method, I would apply linguistic parsers or parts of speech identifiers and see if this could be used to engineer relevant features which provide additional clues for question discrimination or could be used in concert with other machine learning algorithms suited to classification problems.

Appendix: Other approaches

Since this was my first foray into NLP, I initially attacked the problem with simple method known as term-frequency – inverse document frequency (tf-idf). Term frequency refers to the number of times a term occurs in a document, logarithmically scaled. The inverse document frequency is the inverse fraction of documents which contain that term, also logarithmically scaled. This has the effect of down weighting the importance of common terms and up weighting the importance of rare terms when determining the The tf-idf is then calculated by multiplying the term frequency by the inverse document frequency. The resulting matrix represents weighted word frequencies and can be used to compare each question pair.

A bumper sticker for bag of words model enthusiasts, drawing attention to the fact that bag of words models disregard both grammar and word order. Modified from a design by Othar Hansson.

Word2vec from the gensim package is commonly used for python-implemented NLP tasks, and it enables training of models for vector encoding and decoding of words. Word2vec based models can contain impressively deep information about the relationship between words. In an oft-cited example for a popular word2vec model, if vector(“Madrid”) represents the vector for Madrid, then vector(“Madrid”) – vector(“Spain”) + vector(“France”) is closest to vector(“Paris”), despite the model not being explicitly told about country/capitol city relationships.

In addition to skip-thoughts, I also applied several word2vec methods. I trained a custom 300-feature skip-gram word2vec model on all words found in the training questions, which represented 80% of the total set of question pairs. Another approach was to use a word2vec model which was pre-trained on a large corpus of Google News articles, yielding a vocabulary of 3 million words, each encoded by a vector with 300 features. Questions were compared to one another using cosine similarity. A third word2vec method I tried used the same Google News word2vec pre-trained model but compared the question pairs using word movers distance, which is an extension of the earth movers distance problem. In each case I removed common words called stopwords.

I tried several approaches over the course of the project, some of which came in slightly below the performance of skip-thoughts. The approaches had AUCs of 0.63, 0.69, 0.70, and 0.70, for tf-idf, word2vec self-trained model, word2vec Google News pre-trained model, and word2vec pre-trained model using word movers distance, respectively.

Acknowledgements

I would like to thank my fellow Insight Data Science fellows and coordinators for their invaluable feedback. I would especially like to thank Andrej Ficnar and Dan Vatterott for their useful suggestions.

Jana Grcevich

Data Science – Astronomy – Writing