Methods

The logic behind our Code

We decided to develop a mobile application in which a user can upload an image of a student ID and get back the directory information on that student. The workflow of this application has 3 main components. The first is the image uploading in the mobile application. We used React Native to develop our mobile application for iOS. We specifically used a package called ‘react-native-camera’ that allows us to let users take pictures while using our app. The image is then uploaded to our Flask server endpoint which is the second main component of the workflow.

Flask is a microframework for Python that simply lets us set up a server that we can use as an endpoint for our image processing. As described before, the image taken by a user gets sent to our Flask server. When this first endpoint is hit, we then perform our image processing techniques on the image to extract the student’s name from the ID.

From this image provided by the user, our Python script uses a combination of Optical Character recognition, thresholding, filtering, and text-editing to return a clean string of the user’s name, which we use to scrape information from the go/directory page.

Though Pytesseract can pretty reliably return text information from given inputs, we wanted to use our image processing knowledge to be able to return information from pictures of student IDs that are far too overexposed or underexposed. This way, our application is more robust and can handle situations where users give varying inputs. The first step that the Python script takes is that it takes the imported image, converts the image to grayscale, and looks at the picture in general to determine whether it is overexposed or underexposed. We do this by looking at the first half of the values held in the array that represents our image. We compare the cumulative sum of the first half of our image's histogram to the second half. In theory, an underexposed image would be greater in the first half than the second half, and an overexposed image would be greater in the second half than the first half.

From that we return that information and run 18 different thresholds starting from 0.0 to 1.0. However, the way we run the order of the thresholds depends on our information if the image is generally overexposed or underexposed to make sure our code runs more efficiently using image processing techniques.

For example in our thresholding process, if 0 represents black and 1 represents white, an image thresholded for all values above 0.9 would look at each pixel, and if the pixel is above 0.9, it becomes white, or 1, and if it is below 0.9, it becomes black, or 0. We complete this process for 18 different thresholds so we’re able to observe the trends of the images and choose the most lucid picture out of a fairly large sample size without taking up too much time.

For every thresholded image, we then applied a median filter to make our image compatible with Pytesseract. Applying a median filter means that the entire image is analyzed, and within each small block of pixels, the median value of that small group is assigned to all the pixels in block. In our scenario, because all the pixels of a thresholded image are either black (0) or white (1), this process removes any outstanding pixels that could skew our results from running it in Pytesseract. We then search each thresholded image through our validID function, where we check for a legitimate 8-digit ID number and possible names above the ID number. If an OCR output passes this validID test, we return the pytesseract output of the name extracted and the aforementioned 8-digit number.

In our case, the thresholding and median filtering would be able to get results from overexposed images in the higher range of the thresholds (from 0.75 to 0.9), while underexposed images would be able to return results from lower thresholds (from 0.05 to 0.2). Images that provided adequate lighting would be able to return legitimate results from thresholds in the middle range of around 0.7 to 0.4, but because Pytesseract works well with well-lit images, Pytesseract was able to get results from those kinds of images, regardless of whether or not we applied thresholding and filtering techniques.

In the second step of our Python framework, we simply clean outputted string. We remove all extra double line-breaks and spaces to remove unnecessary spaces in between text. We also decided to remove odd occurrences of punctuation and extraneous words such as “Middlebury”, “STUDENT”, and “College” because often times those words would show up as possible student names. We then split every individual word into its own line and, from then on, only look at the OCR string before the occurrence of the ID to try and find the student’s name.

Our third step looks at this new string of the OCR output before the ID number and runs a for loop. It looks for two lines of legitimate text, often with capital letters before the line-breaks to try to ascertain if the line of text is, in fact, a student’s name. After we have found those two lines of the student’s name, we push the student’s two names into a single variable separated by a space, and again removing odd punctuation and numbers.

After the student name is extracted from the image, we pass the result of the users name to our other route endpoint in our Flask server. The GET route passes the student name into our Python script, which handles scraping student data from the Middlebury directory. The student name is passed in, and the student’s email and student address is returned. This script utilizes the Selenium Python package to perform the HTML scraping. Selenium is an open source package that simulates clicking and typing on a web browser. Selenium’s automated browsing lets us input the student name into the directory and scrape the results. Finally after all the image processing and browser scraping has been executed on our server, the mobile application gets back a JSON response with the directory information which is then displayed to the user.

Click the button below to see our results.

Results