Indeed was a proud sponsor of PyData Seattle 2017, an international conference promoting the use of open source data analysis tools for the Python community, such as Pandas, Matplotlib, IPython and Project Jupyter.
Indeed Data Scientists presented two tutorials at the conference:
- Using Pandas for analyzing structured time series data
- Using open source natural language processing (NLP) libraries for analyzing unstructured text
This post introduces their presentations and includes links to videos and tutorials for you to try the exercises yourself.
Joe McCarthy illustrated how to use tools in the Pandas data analysis library to investigate unevenly spaced time series data in The Simpsons. This type of data analysis tends to focus more on the intervals between events rather than the frequency of events occurring within regularly spaced intervals. At Indeed, one example of such a task is estimating how long it takes a recruiter to review a resume (or profile), based on the gaps in timestamps of initial profile disposition events.
Video 1. D’oh! Unevenly spaced time series analysis of The Simpsons in Pandas
Alex Thomas demonstrated how to use open source NLP tools such as the Natural Language Toolkit (NLTK) and word_cloud for vocabulary analysis of job descriptions. His tutorial covered basic NLP techniques such as tokenization, stemming and lemmatization in the context of analyzing job descriptions posted on Indeed. Other techniques include the use of stop words, multi-word phrases (n-grams) and the TF-IDF statistic for estimating the relevance of documents.
Alex highlighted challenges in processing text and some interesting and often-unanticipated problems in interpreting the results of applying each of these techniques.
Video 2. Vocabulary Analysis of Job Descriptions