ENGR 101 | University of Michigan

Project 3: Spaceport Reviews

A picture of a space shuttle with various conflicting reviews of flying with the company, such as FREE SNACKS and THE FOOD MADE ME VOMIT.

Project Checkpoint and Project Partnership Registration Deadline: Tuesday, November 5, 2024
Project Due: Tuesday, November 12, 2024
Late Submissions Due: Tuesday, November 19, 2024

Engineering intersections: Data Science/Industrial & Operations Engineering/Computer Science

Implementing this project provides an opportunity to use C++ file I/O, functions, strings, and vectors.

The autograded portion of the final submission is worth 100 points, and the style and comments grade is worth 10 points, for a total of 110 points.

You may work alone or with a partner. Please see the syllabus for partnership rules.

Educational Objectives and Resources

This project has two aspects. First, this project is an educational tool within the context of ENGR 101, designed to help you learn certain programming skills. This project also serves as an example of a project that you might be given at an internship or job in the future; therefore, we have structured the project description in way that is similar to the kind of document you might be given in your future professional capacity.

Educational Objectives

The purpose of this project is to help you strengthen your skill at working with a large set of textual data, including writing and calling helper functions to help process the data. The C++ programming language is very good at processing quickly and efficiently processing text data, so it is a good choice for this type of work.

This project uses a simple type of Natural Language Processing (NLP). A short introduction to NLP, including the approach used in this project, is in the next section.

Engineering Concepts - Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field at the intersection of computer science and linguistics that designs computer programs that can interpret, categorize, and act on human languages. Human languages are incredibly complex and flexible, and designing a computer program that can understand and respond to even a subset of a single language is notoriously difficult. However, researchers have continually pushed forward in this field, and now NLP has wide-ranging applications including translation between different human languages, virtual assistants, and automated feedback on writing assignments in large courses.

One of the most basic forms of NLP is to search through a body of text looking for keywords, and condition a program’s behavior on the presence or frequency of the keywords. Keywords may also have a numeric weight, which determines how significant they are. For example, say you are interested in determining the type of pet a person has, based on a written description. So, you ask a bunch of people to describe their pets. Some people tell you what kind of pet they have (e.g. duck, rabbit, horse, cat, dog, snake, etc.), but some don’t. Some of them give you sort of a “review” of their pet, and so you need to figure out what kind of pet they have based on a set of keywords.

Let’s say you want to determine whether someone has a duck for a pet. You could analyze keywords of pet descriptions that you have verified are about ducks and not-ducks, and then assign a weight to each keyword based on how strongly it is associated with ducks as pets (Table 1). Positive values are associated with ducks; negative values are associated with not-ducks. The larger the values (regardless of positive/negative), the stronger the association.

You could then read in another pet’s description and find occurrences of the keywords in the list. If the keyword occurs, add its weight to a total running “score” for the pet description. If the total score is positive, it is categorized as a duck. If the total score is negative, it is categorized as a not-duck. See Table 2 for a few examples of this.

Table 1. Example of keywords and weights for describing ducks as pets.

Keyword Weight Keyword Weight
beak 1.2 large -2.5
egg 1.5 legs 0.5
eyes 0.3 loud -0.1
feather 0.7 small 0.6
fins -2.5 tail 0.4
fluffy -0.3 teeth -0.5
fur -1.0 water 1.4
gills -3.2 wings 4.2

Table 2. Example analysis of pet descriptions using the set of keywords and weights for describing a duck. (In case you’re wondering, Pet #3 is a Great Malay Argus.)

Pet #1
Description
I have to make sure to clean her tank every day, so the water doesn't get dirty. I love to watch her gills move as she breathes and her tail and fins move as she swims. Sometimes, though, I wish she had legs so I could take her for a walk.
Keywords Found Weights
water 1.4
gills -3.2
tail 0.4
fins -2.5
legs 0.5
Total Score -3.4
Categorization Not Duck
Pet #2
Description
There are fluffy feathers everywhere all the time! He chipped his beak on a large rock the other day. I think I startled him, because his wings flew wide, and then he lost his balance and that's when he chipped his beak.
Keywords Found Weights
fluffy -0.3
feathers 0.7
beak 1.2
large -2.5
wings 4.2
beak 1.2
Total Score 4.5
Categorization Duck
Pet #3
Description
This bird is insane. Why did no one tell me it would be so LOUD?? Also, it grew enormous. There should be a warning when you buy it that this is a large bird. Not some cute small fluffy thing you can hold on your finger. :(
Keywords Found Weights
loud -0.1
large -2.5
small 0.6
fluffy -0.3
Total Score -2.3
Categorization Not Duck

As shown here, the choice of keywords and weights is crucial. Therefore, you must make sure that your keywords and weights come from a verified set of data and that they have been analyzed properly. (For more background on this, see: Rozado, David. “Wide Range Screening of Algorithmic Bias in Word Embedding Models Using Large Sentiment Lexicons Reveals Underreported Bias Types.” PloS One, vol. 15, no. 4, PUBLIC LIBRARY SCIENCE, 21/4/2020, p. e0231189, doi:10.1371/journal.pone.0231189)

Note: An NLP program that categorizes things based on a keyword search is only as good as its ability to correctly predict things based on those keywords. The percentage of the predictions that are correct is one simple measure of how well an NLP program is working. A perfect NLP program would get 100% of its predictions correct. An NLP program that is no better than random chance would get 50% of its predictions correct.

Project Roadmap

This is a big picture view of how to complete this project. Most of the pieces listed here also have a corresponding section later in this document that provides more detail.

Suggested Project Timeline

Below is a suggest project timeline that you may use. You can adjust the dates around any commitments or classwork that you have going on in other courses.

Date What to have done by this day
Friday,
November 1
  • Have detailed notes on the project specs
  • Know how to use the data sources/files
  • Project folder is set up on your computer and has all data sources/files, starter code, and unit test scripts
  • Have a list of code examples ready for you to use as templates for tasks in this project. Look at the "Common Patterns" from homework assignments, homework exercises, lab exercises, and lecture examples.
Tuesday,
November 5th
  • Project Checkpoint Completed: readReview, readKeywordWeights, wordWeight, and reviewScore functions are written, tested, and debugged by today (includes submitting to the autograder)
  • Project Partnership Registered on Autograder by today
Friday,
November 8th
  • The evaluateReview.cpp driver program can read in keywords and weights, read in a review, and evaluate the review's score.
  • Have a plan for how to expand this process to handle multiple reviews, keeping track of the review with the highest score and the review with the lowest score.
Monday,
November 11th
  • The evaluateReviews.cpp driver program is written, tested, and debugged (including submitting to the autograder)
  • At least one submission to the autograder includes all required file submitted and all test cases passed
  • Code has been double-checked for quality: style and commenting
Tuesday,
November 12th
  • Project due!
  • Verify that everything has been submitted to the autograder correctly.

Things to Know Before You Get Started

Tips and Tricks

Here are some tips to (hopefully) reduce your frustration on this project:

Writing to a File in a Loop

This project requires you to write to a file from within a loop. Sometimes, during the course of development, an infinite loop may slip through your careful debugging and cause a file to grow significantly larger than you would want. Here are a few hints to detect if this is happening:

If your program takes longer than 2 seconds to run, type CTRL+C to cancel the running program.

If you are on your own computer, check how big the report.txt file is. If it is more than, say, 30 Mb, then you have an infinite loop. Delete report.txt.

If you are on a CAEN machine, navigate into the directory where your Project 3 files are located, type ls -lh at the linux terminal, and check the size of the files, like this:

bash-4.2$ ls –lh
total 1.1G
...
-rwxr-xr-x. 1 your_uniqname users 9.4K Nov 18 18:11 evaluateReviews.cpp
-rw-r--r--. 1 your_uniqname users  223 Nov 15 09:22 readKeywordWeights.cpp
-rw-r--r--. 1 your_uniqname users  356 Nov 16 12:56 readReview.cpp
-rw-r--r--. 1 your_uniqname users 1.0G Nov 18 18:20 report.txt
-rw-r--r--. 1 your_uniqname users  447 Nov 16 15:04 reviewScore.cpp
...

The report.txt file listed above has a file size of 1.0Gb (yikes!) and should be deleted. To delete the file, use the rm command:

bash-4.2$ rm report.txt

Passing Filestreams to Functions

One of the required functions needs a filestream passed to it, and you may potentially write a helper function or two that also requires a filestream to be passed in. Remember that filestreams are linked to a specific file; therefore you need to pass the filestreams by reference (not pass by value) so that the function has access to the file that was already opened by your program.

Undefined Values for Variables

Using a variable that has been declared, but does not yet have a value assigned to it, can cause unexpected behavior from your program. Similarly, if you write a function that has a return variable (e.g. an int, double, bool, etc. function), and you try to return a variable that does not have a value – or you forget a return statement entirely – then you can also see unexpected behavior. Some compilers will automatically assign a zero to some types of data, but others will not; therefore, you should always properly assign/initialize values to your variables. If you see that your program outputs big weird numbers on the autograder, it’s likely due to an uninitialized variable.

Finding the Minimum/Maximum of a Set of Data

When searching, or keeping track of, the maximum/minimum of a set of data (such as the score of the hotel reviews), you generally are comparing the current value to whatever is the “current highest” or the “current lowest” value. However, for the first value, you don’t yet have anything to compare it to. A common best practice is to initialize the “current highest” and/or “current lowest” value to be the first value in your dataset. This way, no matter what the actual values in your dataset are, you will always be able to find the maximum or minimum value.

Additional Helper Functions

There are four required helper functions for this project that you have to write plus two helper functions that are provided to you; however, you can (and should!) look to abstract other chunks of code into helper functions. Helper functions make your code easier to read and understand and easier to debug. Some ideas for other helper functions are functions that would:

It is up to you how you want to abstract portions of your code; there is no “right” answer and no “wrong” answer. The only reason we’re requiring a few specific helper functions for this project is to enforce practice with abstraction and function writing. In general, you can design your functions however you like!

Pass by Value vs. Pass by Reference

When writing your own helper functions, always consider whether to pass parameters by value or by reference (including const reference). If you are unsure, refer back to the Homework assignment that had the decision tree about pass by value vs. pass by reference.

Submission and Grading

This project has three deliverables: reviews.h, reviews.cpp, and evaluateReviews.cpp. See the Deliverables section for more details.

After the due date, the Autograder portion of the project will be graded in two parts. First, the Autograder will be responsible for evaluating your submission. Second, one of our graders will evaluate your submission for style and commenting and will provide a maximum score of 10 points. Thus, the maximum total number of points on this project is 110 points.

You should still submit reviews.h, even though you aren’t supposed to modify it - the autograder will double check that the file is the same to verify you haven’t accidentally changed it, since this could mess up the rest of your code.

Submitting Prior to the Project Deadline

Submit the .cpp files and .h file to the Autograder for grading. You do not have wait until you have all of the files ready to submit before you submit to the Autograder for the first time. In fact, we recommend that as you complete tasks for this project, you should continually submit those files to the autograder for feedback as you work on the project. You can submit a subset of files and get feedback on the test cases related to those files. However, to receive full credit, you need to submit all files and pass all test cases within a single submission.

The autograder will run a set of public tests - these are the same as the test scripts provided in this project specification. It will give you your score on these tests, as well as feedback on their output.

The autograder also runs a set of hidden tests. These are additional tests of your code (e.g. special cases). You still see your score on these tests, and you will receive some feedback on any cases that your code does not pass.

You are limited to 5 submissions on the Autograder per day. After the 5th submission, you are still able to submit, but all feedback will be hidden other than confirmation that your code was submitted. The autograder will only report a subset of the tests it runs. It is up to you to develop tests to find scenarios where your code might not produce the correct results.

You will receive a score for the autograded portion equal to the score of your best submission.

Your latest submission with the best score will be the code that is style graded. For example, let’s assume that you turned in the following:

Submission # 1 2 3 4
Score 50 100 100 75

Then your grade for the autograded portion would be 100 points (the best score of all submissions) and Submission #3 would be style graded since it is the latest submission with the best score.

Please refer to the syllabus for more information regarding partner groups and general information about the Autograder.

Here is the breakdown of points for style and commenting (max score of 10 points):

Submitting after the project deadline

If you need to submit your project work after the deadline, you can submit to the “Late Submission” project assignment on the Autograder. The late submission assignment includes all of the same test cases as the original assignment, but the points have been adjusted down a small amount per the syllabus’ flexible deadline policy.

Your project score at the end of the semester will be whichever is the higher score between the original project assignment and the late submission assignment. You will never be penalized for submitting to the late submission version of the project.

Autograder Details

Test Case Descriptions and Error Messages

Each of the test cases on the Autograder is testing for specific things about your code. Often, it’s checking to see if your programs can handle “special cases” of data, such as: a different number of reviews, a different set of keyword weights, the reviews are all truthful, the reviews are all deceptive, etc.

Each of the test cases on the Autograder has a description of what it’s checking for. If your program fails a test case, the Autograder will sometimes be able to give you some advice on how to go about debugging your code. It’s like you have a friendly GSI or IA giving you immediate help!

Checking If report.txt Is Correct

Within the Autograder tests, you can click to expand the “Check report.txt” test and see a window that compares the expected output for the file (on the left) to your output file (on the right). The Autograder is very particular about having the output match perfectly. This means every character and every whitespace must match – so beyond the score values and category, be sure each word in your report.txt is spelled correctly, has matching spaces and matching new lines. Lines that don’t correspond with the excpected output will be highlighted, and you can click “Show Whitespace” to see the right spacing and new line placement.

Project Overview

The purpose of this project is to give you a chance to practice file input and output (file I/O) with file streams and to practice writing your own functions. It also gives more practice using strings and vectors.

Background and Motivation

Proxima b’s new spaceport has been up and running for 6 months. The company that owns the spaceport wants to analyze the online reviews of the spaceport so that they can improve their customers’ experiences. However, they suspect that only some of the reviews are from actual customers, and that the others are fake! Unfortunately, it is difficult for humans to distinguish between truthful and deceptive reviews – studies show success rates of approximately 50-60%, which is not much better than guessing. It would also be very time consuming to examine each review by hand.

Instead, the company wants to use an automated text-classification algorithm based on machine learning and natural language processing techniques. A challenge with any machine learning approach is finding high-quality training data. Training data is data that has been independently analyzed and verified, so you can use the training data to check whether a new program is working correctly or not.

Fortunately, a dataset of carefully verified truthful and deceptive reviews is available, thanks to an ancient study from 2011 EY (Earth Years) that investigated reviews of Chicago hotels. Based on the data available in this study, data scientists working for the company have developed a set of keywords that indicate either truthfulness or deception.

The Earthlings who originally studied the hotel reviews also developed a website where people could put in a review and the review would be evaluated for truthfulness or deceptiveness. Alas, this site is no longer accessible, but old news articles of the discovery have been recovered, one of which is avaliable here. This article shows a high level idea of what the company wants to do.

Your Job

Your job is to write a “proof of concept” program that evaluates reviews for truthfulness or deception. This “proof of concept” program will use the hotel reviews (not the spaceport reviews) because you want to work with the training data first in order to verify that your program works correctly. Your program should evaluate each of the hotel reviews and categorize them as truthful or deceptive using an NLP Keyword Search. The program should then print out a summary report of the reviews and identify the review with the highest score (the “most truthful review of the dataset”) and identify the review with the lowest score (the “most deceptive review of the dataset”).

The next step in this process would be to apply the program you write to evaluate the spaceport reviews. But that is a hypothetical scenario which we will NOT be doing for this project. You are ONLY looking at hotel reviews.

Data Sources/Files

These sections describe the different sets of data you have available for implementing and testing your programs for this project. Make sure you understand the different formats and layouts of the data in the data files before you start to work with the data itself.

Keywords and Weights File

The keywordWeights.txt file (click filename to download) contains a list of keywords and weights that resulted from analyzing an initial set of verified hotel reviews as training data. Each keyword will be on its own line, and there will be no multi-word phrases (e.g. “good” and “view” instead of “good view”). Here is a schematic of what the keywordWeights.txt file looks like:

<keyword 1> <score 1>
<keyword 2> <score 2>
<keyword 3> <score 3>
<keyword 4> <score 4>
<keyword 5> <score 5>
<...> 

Read in all the keywords and weights in the file, no matter how many keywords there are, and store them in parallel vectors of string variables and double variables.

Hotel Review Files

The reviewFiles.zip file (click filename to download) contains a set of 20 hotel reviews for you to use in testing your program. The hotel reviews all have a similar naming convention of review00.txt, review01.txt, etc. Reviews 0-9 are actually truthful; Reviews 10-19 are actually deceptive. (However, the algorithm you implement doesn’t categorize them all perfectly correctly, as you will see.)

Make sure you actually unzip the reviewFiles.zip file to get to the individual review files. If you are on a Mac, double-click the .zip file to automatically “unzip” the file and make a folder with a bunch of .txt files in it; move the files to your Project 3 folder. But Windows will often let you double-click on the .zip file and see the files but not actually unzip the files… which means your computer can’t actually access the files yet. Instead, right-click on the file and select “Extract All” to unzip/uncompress the file to actually get access to the .txt files. If you have trouble with this, please come to office hours!

Each hotel review is stored in its own file, e.g. review18.txt. The text is in “paragraph” form; here is a schematic of what a hotel review file looks like:

<word 1> <word 2> <word 3> <word 4> <word 5> 
<word 6> <word 7> <word 8> <word 9> <word 10> <word 11> 
<word 12> <word 13> <word 14> <...> 

To work with a hotel review, open the file, read in each word, and store the words in a vector of string variables.

On the Autograder, you are guaranteed that there will always be at least one review and that there will be no gaps in the numbering of the hotel reviews. There will be no more than 100 hotel reviews.

Deliverables

This project has three deliverables:

File Description
reviews.h a C++ header file for the reviews.cpp library
(This file is already provided to you, but it is part of this program's set of files, so it is considered a deliverable.)
reviews.cpp a C++ file that contains all of the required helper functions used in the project
evaluateReviews.cpp a C++ file that contains your main() function and is the driver program file. Any helper functions that you write in addition to those should be correctly declared and defined in the evaluateReviews.cpp file.

Starter Code and Test Programs

The starter code files contain some code that is already written for you, and you will write the rest of the code needed to implement the tasks described in the Project Task Description section. The starter code files have _starter appended to the file’s name so that if you want to download a fresh copy of the starter code, you won’t accidentally overwrite an existing version of the file that may have some code you have written in it. Remember to remove the _starter part of the filename so that your programs will run correctly!

Click the filenames to download the starter code:

The Reviews library has several functions that you will be using when you write your driver program in evaluateReviews.cpp. You should carefully test the functions you write in reviews.cpp before trying to use them in evaluateReviews.cpp. Here is a driver program that contains some basics tests for some of the functions in reviews.cpp (click the filename to download):

You can (and should!) add additional tests to the unit_tests.cpp program to thoroughly understand and test all of the functions in the Reviews library. See the Testing Your Functions Section for more information about how to use the unit_tests.cpp file.

Once you have verified that all of the functions in the Reviews library work, you can start working on the driver program that is evaluateReviews.cpp. The driver program creats a summary report saved as report.txt. Here is a sample report that shows what your report.txt file should look like if your program works correctly (click the filename to download):

See the Test Case for Evaluating Reviews for more information about how to use the sample_report.txt file.

Compiling and Running the Program

To compile the program, use the following compile command:

g++ -std=c++11 -Wall -pedantic evaluateReviews.cpp reviews.cpp -o evaluateReviews

Notes about compiling this program:

Once compiled, the program can be run from the command line with:

./evaluateReviews

Project Task Description

There are two primary tasks for your program:

  1. Create a library of helper functions to assist with processing the hotel reviews
  2. Evaluate hotel reviews as truthful, deceptive, or uncategorized, and create a summary report of the analysis

These tasks are described in more detail in the next secions.

Task 1: Reviews Library

The first task in this project is to implement a library of several helper functions that support working with and processing the hotel reviews. A C++ library needs two files: a .h file that contains the interface for the library, and a .cpp file that contains the implementation of the library’s functions. Click the filenames to download the starter files for your Reviews library:

Don’t forget to remove _starter from the filename before you try to compile with this file!

Functions in the Reviews Library

There are two functions in the Reviews library that are written for you: makeReviewFilename and preprocessReview. A brief description is here, and more details can be found in the comments in the code:

You are responsible for writing the implementations of the remaining functions in reviews.cpp. There are four such functions, listed briefly here and described in more detail in the following sections:

These four functions are described below, and additional information is included as comments in the reviews.cpp starter file. Read these comments and consider them part of the project specification. You will also conduct unit tests on the individual helper functions so you can ensure they are working correctly.

The readKeywordWeights Function

This is a function that will read in the keywords and their numerical weights from an input stream. The functions stores each word in the text file to a vector of string variables and stores the corresponding weights into a vector of double variables – thereby creating two parallel vectors for the keywords and their weights.

Description

// Reads in keywords and corresponding weights from an input stream and stores them into
// the 'keywords' and 'weights' vectors in the same order as they appear in the file.
// NOTE: They keywords in the file have already been preprocessed (e.g. to remove punctuation),
//       so you do not have to do that here.
// PARAMETERS:
//   input - An input stream from which keywords and weights are read. For this project, we
//           assume the input stream is a file input stream, where the file format is that
//           provided in the project specification.
//   keywords - An "output parameter", passed by reference, into which the keywords are stored.
//   weights - An "output parameter", passed by reference, into which the weights are stored.
void readKeywordWeights(istream &input, vector<string> &keywords, vector<double> &weights) {
    // TODO: Write an implementation for this function!
}

Algorithm

This function is passed a filestream connected to a text file, an empty vector of string variables, and an empty vector of double variables. The vector of strings gets “filled up” by the words in the text file, and the vector of doubles gets “filled up” by the numbers in the text file.

The Homework assignments include two common patterns that are particularly applicable for this function:

Review the examples in the Homework for these patterns and consider how to adapt the examples for what you need to do here.

The readReview Function

This is a function that will read in words from a review text file and store each word in the text file to a vector of string variables.

Description

// Reads in a review from an input stream and stores each individual word from the review
// into the vector 'reviewWords', in the same order they appeared in the input.
// PARAMETERS:
//   input - An input stream from which the review is read. For this project, we assume the
//           input stream is a file input stream, with words separated by whitespace.and weights are read.
//   reviewWords - An "output parameter", passed by reference, into which the review words are stored.
void readReview(istream &input, vector<string> &reviewWords) {
    // TODO: Write an implementation for this function!
}

Algorithm

This function passes in a filestream connected to a text file, and it passes in an empty vector of string variables. The vector of strings gets “filled up” by the words in the text file.

The Homework assignments include two common patterns that are particularly applicable for this function:

Review the examples in the Homework for these patterns and consider how to adapt the examples for what you need to do here.

The wordWeight Function

This function determines the weight of a word in one of the hotel reviews. If the word matches one of the keywords, then the function returns the value of the keyword’s weight; otherwise, the function returns 0.0.

Description

// Returns the weight of a given word by looking it up in the provided vectors.
// The keywords and their corresponding weights are provided as vector parameters.
// It is assumed that these are parallel vectors, so that weights[i] is the weight of keywords[i].
// If a word does not appear in the keywords vector, its weight is zero.
// PARAMETERS:
//   word - The word to be looked up
//   keywords - A vector containing all keywords.
//   weights - A vector containing weights corresponding to each keyword.
double wordWeight(const string &word, const vector<string> &keywords, const vector<double> &weights) {
    // TODO: Write an implementation for this function!
}

Algorithm

This is a function that is passed one word from the vector that represents a hotel review. The function is also passed the parallel vectors that contain the keywords and their corresponding weights.

The function checks the review’s word against each of the keywords to see if they match. If a match is found, the function returns the weight of the keyword that matches the review word. If no match is found, the function should return 0.0 for the word’s weight.

The Homework assignments include two common patterns that are particularly applicable for this function:

Review the examples in the Homework for these patterns and consider how to adapt the examples for what you need to do here.

The reviewScore Function

This function calculates the score of a review. The review’s score is the sum of the weights of all the words in the review, so the wordWeight function will be helpful here!

Description

// Computes and returns the overall score for a review. This is the sum of the weights of
// the individual words in the review. Note that a word may appear more than once in the review,
// and if this happens it's weight is added in multiple times as well. The keywords and their
// corresponding weights are provided as vector parameters. It is assumed that these are parallel
// vectors, so that weights[i] is the weight of keywords[i]. If a word does not appear in the
// keywords vector, its weight is zero.
// HINT: Make a copy of the reviewWords vector using a separate variable. Then, call the
//       preprocessReview() function on the copy. Having a preprocessed copy of the words
//       will allow you to compare against the keywords.
// PARAMETERS:
//   reviewWords - A vector containing the individual words in the review.
//   keywords - A vector containing all keywords.
//   weights - A vector containing weights corresponding to each keyword.
double reviewScore(const vector<string> &reviewWords, const vector<string> &keywords, const vector<double> &weights) {
    // TODO: Write an implementation for this function!
}

Algorithm

This function is passed three parameters:

  1. Make a copy of the review to work with in this function, so that you keep the original version to potentially work on later.
  2. Preprocess the copy of the review to standardize the text and make it easier to search for keywords. Important! Call the preprocessReview helper function already written for you to do the preprocessing. Look at the description of preprocessReview to understand how to call it and what it does.
  3. Iterate through each word of the preprocessed review, get its weight using the wordWeight function, and add the word’s weight to a running total score.
  4. After all words are processed, the function returns the total score.

For an example of the process of scoring the review, refer to the keyword search process described earlier. Note that in this case, words which are not identified as keywords will simply have a weight of 0.0 returned from wordWeight, so that it is safe to add them in without affecting the overall score.

The Homework assignments include a common pattern that is particularly applicable for this function:

Review the example in the Homework Assignment for this pattern and consider how to adapt the example for what you need to do here.

Testing Your Functions

When working with complex programs made up of several different functions, it’s important to be able to test each function individually to make sure it is working correctly on its own. This is called unit testing. A strategy for implementing a set of unit tests is to write a separate main function (in a different file) that intentionally calls each function one at a time on a variety of inputs and confirms that the outputs from the functions match the expected correct answer.

A few sample unit tests can be found in unit_tests.cpp file provided with the project. The comments included in that file describe the way the unit testing process works. We highly encourage you to use these samples as a starting point and write additional unit tests of your own.

In order to compile and run the unit tests, use the following commands:

g++ -std=c++11 -Wall -pedantic unit_tests.cpp reviews.cpp -o unit_tests

./unit_tests

Note the difference from the compilation command for the regular program. We’ve basically kept all the review functions from reviews.cpp, but we’ve swapped in unit_tests.cpp for evaluateReviews.cpp, which means the main function containing the tests will be used instead.

You do not need to turn in the unit_tests.cpp file to the autograder.

Task 2: Driver Program

The second task in this project is to write a driver program that will evaluate the hotel reviews as truthful, deceptive, or uncategorized, and create a summary report of the analysis. The driver program will use the functions in the Reviews library to:

Description of evaluateReviews.cpp

The driver program is written in the evaluateReviews_starter.cpp file. Don’t forget to remove _starter from the filename before you try to compile with this file!

// Add any #includes for C++ libraries here.
// We have already included iostream as an example.
#include <iostream>

// The #include adds all the function declarations (a.k.a. prototypes) from the
// reviews.h file, which means the compiler knows about them when it is compiling
// the main function below (e.g. it can verify the parameter types and return types
// of the function declarations match the way those functions are used in main() ).
// However, the #include does not add the actual code for the functions, which is
// in reviews.cpp. This means you need to compile with a g++ command including both
// .cpp source files. For this project, we will being using some features from C++11,
// which requires an additional flag. Compile with this command:
//     g++ --std=c++11 evaluateReviews.cpp reviews.cpp -o evaluateReviews
#include "reviews.h"

using namespace std;

const double SCORE_LIMIT_TRUTHFUL = 3;
const double SCORE_LIMIT_DECEPTIVE = -3;


int main(){


    // TODO: implement the main program

}

Algorithm

This is the general algorithm for the driver program (the main function in evaluateReviews.cpp):

  1. Open a file input stream for the keywordWeights.txt file.
    1. If the file cannot be opened, print "Error: keywordWeights.txt could not be opened." to cout
    2. Use return 1; to exit the main function (recall that a nonzero return value from main reports an error).
  2. If the keyword weights file was opened, read the keywords and their weights into parallel vectors. (Which Reviews library function would be helpful here?)
  3. For each hotel review,
    1. Create the filename (e.g. review00.txt) (Which Reviews library function would be helpful here?)
    2. Open a filestream to the file
    3. Read each word of the review into a vector of string variables (Which Reviews library function would be helpful here?)
    4. Calculate the review’s score (Which Reviews library function would be helpful here?)
    5. Determine the review’s category:
      • truthful: score > 3.0
      • deceptive: score < -3.0
      • uncategorized: otherwise
    6. Track the review with the highest score and the review with the lowest score
  4. Write out a summary of the truthfulness and deceptiveness of the reviews to a file named report.txt
  5. Print "Program complete. Check report.txt file for summary." to cout to indicate that the program has finished. Do not print the contents of report.txt.
    • Note: There should be two spaces after Program complete. not just one. Your browser may “collapse” these two spaces into one space when you view these specs.

You know that you have processed all the reviews once you try to open a file input stream for the next review file and it does not open successfully (because the file doesn’t exist!). You don’t need to print an error message in this case - simply have your program stop trying to read more reviews.

You are guaranteed that there will always be at least one review and that there will be no gaps in the numbering of the hotel reviews. There will be no more than 100 hotel reviews.

Make sure your summary of the reviews is being written to the report.txt file, not cout.

You may write additional helper functions in evaluateReviews.cpp! (See the Additional Helper Functions section for some suggestions.) However, do not add additional functions to reviews.cpp that you intend to use in evaluateReviews.cpp, since this would require modifying the review.h file to contain their prototypes as well, which is prohibited for this project. Instead, put your own additional helper functions into the evaluateReviews.cpp file.

Details of the Summary Report

There is one output file for this program: report.txt. This file contains a summary report of your hotel review analysis. The summary report should contain:

Test Case for Evaluating Reviews

Test cases are very important for this project. You should first use unit tests for your helper functions so you know they are working correctly. Once you have verified that your helper functions work correctly, continue developing your code to eventually create the report.txt file that summarizes the analysis.

You are provided with 20 hotel reviews as test data. Reviews 0-9 are known to be truthful and reviews 10-19 are known to be deceptive; although, your algorithm won’t be able to correctly categorize ALL of the reviews. The sample_report.txt file contains the correct output for your report.txt file. Make sure you can recreate what is in sample_report.txt exactly!

review score category
0 3.49 truthful
1 3.33 truthful
2 15.68 truthful
3 4.43 truthful
4 4.14 truthful
5 11.29 truthful
6 20.61 truthful
7 -2.89 uncategorized
8 2.71 uncategorized
9 11.93 truthful
10 0.03 uncategorized
11 -13.06 deceptive
12 -3.66 deceptive
13 -8.46 deceptive
14 5.18 truthful
15 -18.68 deceptive
16 -17.88 deceptive
17 -21.61 deceptive
18 -11.25 deceptive
19 -15.08 deceptive

Number of reviews: 20
Number of truthful reviews: 9
Number of deceptive reviews: 8
Number of uncategorized reviews: 3

Review with highest score: 6
Review with lowest score: 17

FAQs

Here are frequently asked questions about this project. If you are stuck on something, start here!

General FAQs

How do I read multiple different data types from a file?
C++ input operators are actually very smart. The insertion operator ( >> ) reads information up until it encounters whitespace (spaces, tabs, new lines) and then stores that information into the variable you specify. So if there is a space in between different types of data, we can use >> as many times as we need to read in each piece of data and store it in the appropriate variable as long as the variables are declared as the correct types beforehand.

For example, if we have a file named apples.dat with the data:
4 apples 425 grams 
We can read in each piece of information by declaring a variable to hold each piece of information and then using >> a bunch of times to read each piece of information into the appropriate variable:
    // declare variables to hold data
    int num;
    string fruit_type;
    double mass;
    string units;

    // open an input filestream 
    ifstream apple_data("apples.dat");

    // read each piece of data, in the order it appears in the file
    apple_data >> num >> fruit_type >> mass >> units;
    
You can also refer to Homework 16.11 for another example of this approach.
What does everyone mean by "inserting a cout statement" to debug something?
Putting a cout statement (along with any useful context) can help you figure out exactly where / why things are going wrong in your code. For example, if your program is stopping unexpectedly and you're not sure why, you can put "checkpoint" cout statements in your code like this:
// some code here

cout << "Checkpoint 1" << endl; 

// some more code

cout << "Checkpoint 2" << endl; 

// even more code 
This way, if you run your code, and you only see Checkpoint 1 in your terminal, you know the issue is somewhere in the lines of code between between those checkpoints. This helps you narrow down where any possible issues could be. This is very helpful with making sure if statements or loops run correctly!

See the lectures on debugging for more examples of using cout statements to debug errors.
My test cases aren't working when there are less than 100 reviews but the "all review files" test cases are correct. What am I doing wrong?
Remember to add code to end the loop early if a review file cannot be found. If there are no more files to open, the program should break out of the loop instead of trying to open a non-existent file. See the FAQs about how to end a loop early / how to know how many times to loop through the files for more information.
My max and min are off for some test cases on the autograder.
Revisit the Finding the "Best" Element and the Finding the Index of the "Best" Element common patterns in Homework 17. These common pattern will help you handle the cases such as "all negative reviews", "all positive reviews", or even when you have reviews with scores that are all 0. 
My output is the same, but autograder gives me a 0.
Here are some things to check:

  • Check for common misspellings (e.g. "reviews" misspelled as "reveiws" and other misspellings)
  • Check for correct whitespace around words (spaces, newlines, etc.). You can toggle the Autograder output to show the whitespace characters to help with this.
  • Check for the correct return value from your main() function. Remember that if a required file cannot be opened, you should be returning a value of 1 (to indicate an error so the user knows something is wrong).
  • Double check that you are submitting the correct file! Sometimes, students create a duplicate by accident and submit the duplicate.
My code doesn't compile because of an 'undefined reference' or 'linker error'
You may have undefined references to the helper functions. Make sure you are including the proper header files and compiling with all the .cpp files.

Both the evaluateReviews.cpp and reviews.cpp files need to include the reviews.h header file so that the two files can be used together. So, both files need to have this line of code near the top of the file where the other #include statements are:
#include reviews.h
The compiler also needs to know about both the evaluateReviews.cpp file and the reviews.cpp file, since the the reviews.cpp file is where your helper functions are. So you should be compiling with this line at the terminal:

g++ --std=c++11 evaluateReviews.cpp reviews.cpp -o evaluateReviews

Refer to the project specifications for more details on why you need to do this (and for the command to compile your unit tests).
I'm getting an out_of_range error or a segmentation fault (aka seg fault)!
Somewhere, you are indexing into a vector or a string and the index is "out of range". If you are using .at() for your indexing, then .at() is telling you that you are indexing out of range. If you are using [] for your indexing, it can be a little harder to find track down where the indexing out of range is happening, but you can still do it. In general, you likely have one of these situations going on somewhere in your code:

  1. You could have an index variable that is negative. For example, if i is -1 and then you have a statement with a vector that includes .at(i), that would cause this error.
  2. You could have an index variable that is 0, but the vector is empty. For example, if i is 0 and then you have a statement with an empty vector that includes .at(i), that would cause this error.
  3. You could have an index value that is too high. For example, if i is 23 but your vector only has 19 elements, and you have a statement with this vector that includes .at(i), that would cause this error.
Track down where you are indexing out of range by putting in some cout statements with messages like, "About to call helper function XXX...." or "This is line XX". After you save, compile, and run your program, you will see your messages print to the terminal. Eventually, you will see that "out of range" error. You can look to see what was the last message to print, and then you know that your out of range indexing is somewhere between the last message that printed and the next message (that did NOT print).
My code compiles with no errors and the executable is built, but when I run the executable the report.txt file is not generated.
Check that you don't have a random return (or return 1;) statement somewhere it is not supposed to be. You should only have return 1; if the keywords file or the weights file cannot be opened.

You can also add some cout statements to find how far into your program you get before the program ends; that will help you narrow down where things are going wrong. 

Task 1 FAQs

What does it mean by "Make a copy of the review to work within this function?" Am I supposed to make a copy of the review within the function?  Or are we supposed to make a physical copy of the review file on our computer to test the function?
Make a copy of the vector that has the review words in it. In other words: make a new vector of strings, and then assign to it the values of the vector that contains the words in the review file. This way, you can call the preprocess function on the "copy of the review" (the new vector you made) without losing access to the original vector.
Why do I need to make a copy of the review in reviewScore?
Take a look at the function header of reviewScore, paying attention to the parameter types. All of the parameters are vectors and are being passed by reference. If we preprocess the original review directly, instead of making a copy, it will permanently change the review in your driver program as well. What problems could this cause later on?
My wordWeight function always returns 0
Functions in C++ return a value once, so trace through your code to make sure you are not returning too early before you've iterated through the entire vector.

Task 2 FAQs

How do you know how many times to loop if we don’t know how many files we will need to read?
We know that we will have at most 100 files, so we can make a for loop that loops 100 times. We will need to break out of the loop if a file does not exist. For example, if there are 20 reviews, then the 21st review (which would be file review20.txt) does not exist. So we can have an if statement that checks whether the file did not open correctly and, if so, call break to exit out of the loop early.

See Homework 14.6 for an example of using break to exit a loop early. See the Redact Info program in the Project 3 Overview lecture for an example of checking to see if a file could be opened.
I'm getting all zeros for the review scores.
There could be many reasons for this, but make sure that your reviews files are in the same folder as your .cpp files so that your program can find those files and open them. This is a common reason for your program running but behaving incorrectly.

Other reasons include: out of scope variables, accidentally resetting your variables to 0, and not including the correct header files.
My scores are always increasing for some reason.
Be sure to start your analysis of each review with an empty review words vector. You can either keep one vector and empty it each time your program reads in a new file, or you can use a vector that is only within scope of the loop so that it gets "erased" and then created anew each time the program goes through the loop.
I never enter the loop to check all the reviews.
There could be many reasons for this, but a common reason is that you don't have the review files in the same folder as your .cpp files, so your progam can't find the review files in order to open them. First, make sure that that you have unzipped/extracted the reviews from the .zip folder. Then, make sure that all 20 review files (review00.txt -- review19.txt) in the same folder as your .cpp files. If you have done this correctly, you will see the review files in the same folder as your .cpp files. You should not have them in a separate folder. See the comparison below:

TODO
GOOD: The review files are in the same folder as the .cpp files.
TODO
BAD: The review files are in a subfolder called reviewFiles that is within the folder where the .cpp files. If your workspace looks like this, move the review files out of the subfolder and into the main folder with your Project 3 code.
My program runs but I just don't get correct behavior / values.
There could be many reasons for this, but make sure that your reviews files are in the same folder as your .cpp files so that your program can find those files and open them. This is a common reason for your program running but behaving incorrectly. Also, read the other FAQs about getting incorrect values.
How do I know when I've read all the reviews / know when to stop?
See Homework 16.7, the video File Input/Output with Streams. The video recap also has a note about how you can check if a file does not open. You can also see Homework 14.6, for how to exit a loop early using the break function if a condition is met.
I sometimes get really small values with my min and max review numbers.
Check that all of your variables are initialized when you declare them, or at least have been assigned a value before you first use the variable in an expression. If you have done this, think about what the initial values of the min and max should be initialized them to. See Homework 17.13 -- Finding the Index of the "Best" Element for a helpful example.
My value for review with highest/lowest score in the report is way off.
When you are determining the highest score in the set of reviews, make sure you are also keeping track of the index/number of the review with the highest/lowest score, not just the value of the max/min score itself. See Homework 17.13 -- Finding the Index of the "Best" Element for a helpful example.