# Project 3: Spaceport Reviews

Project Checkpoint and Project Partnership Registration Deadline: Tuesday, November 8, 2022
Project Due: Tuesday, November 15, 2022

Engineering intersections: Data Science/Industrial & Operations Engineering/Computer Science

Implementing this project provides an opportunity to use C++ file I/O, functions, strings, and vectors.

The autograded portion of the final submission is worth 100 points, and the style and comments grade is worth 10 points, for a total of 110 points.

You may work alone or with a partner. Please see the syllabus for partnership rules.

# Educational Objectives and Resources

This project has two aspects. First, this project is an educational tool within the context of ENGR 101, designed to help you learn certain programming skills. This project also serves as an example of a project that you might be given at an internship or job in the future; therefore, we have structured the project description in way that is similar to the kind of document you might be given in your future professional capacity.

## Educational Objectives

The purpose of this project is to help you strengthen your skill at working with a large set of textual data, including writing and calling helper functions to help process the data. The C++ programming language is very good at processing quickly and efficiently processing text data, so it is a good choice for this type of work.

This project uses a simple type of Natural Language Processing (NLP). A short introduction to NLP, including the approach used in this project, is in the next section.

## Engineering Concepts - Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field at the intersection of computer science and linguistics that designs computer programs that can interpret, categorize, and act on human languages. Human languages are incredibly complex and flexible, and designing a computer program that can understand and respond to even a subset of a single language is notoriously difficult. However, researchers have continually pushed forward in this field, and now NLP has wide-ranging applications including translation between different human languages, virtual assistants, and automated feedback on writing assignments in large courses.

One of the most basic forms of NLP is to search through a body of text looking for keywords, and condition a program’s behavior on the presence or frequency of the keywords. Keywords may also have a numeric weight, which determines how significant they are. For example, say you are interested in determining the type of pet a person has, based on a written description. So, you ask a bunch of people to describe their pets. Some people tell you what kind of pet they have (e.g. duck, rabbit, horse, cat, dog, snake, etc.), but some don’t. Some of them give you sort of a “review” of their pet, and so you need to figure out what kind of pet they have based on a set of keywords.

Let’s say you want to determine whether someone has a duck for a pet. You could analyze keywords of pet descriptions that you have verified are about ducks and not-ducks, and then assign a weight to each keyword based on how strongly it is associated with ducks as pets (Table 1). Positive values are associated with ducks; negative values are associated with not-ducks. The larger the values (regardless of positive/negative), the stronger the association.

You could then read in another pet’s description and find occurrences of the keywords in the list. If the keyword occurs, add its weight to a total running “score” for the pet description. If the total score is positive, it is categorized as a duck. If the total score is negative, it is categorized as a not-duck. See Table 2 for a few examples of this.

Table 1. Example of keywords and weights for describing ducks as pets.

Keyword Weight Keyword Weight
beak 1.2 large -2.5
egg 1.5 legs 0.5
eyes 0.3 loud -0.1
feather 0.7 small 0.6
fins -2.5 tail 0.4
fluffy -0.3 teeth -0.5
fur -1.0 water 1.4
gills -3.2 wings 4.2

Table 2. Example analysis of pet descriptions using the set of keywords and weights for describing a duck. (In case you’re wondering, Pet #3 is a Great Malay Argus.)

 Pet #1 Description I have to make sure to clean her tank every day, so the water doesn't get dirty. I love to watch her gills move as she breathes and her tail and fins move as she swims. Sometimes, though, I wish she had legs so I could take her for a walk. Keywords Found Weights water 1.4 gills -3.2 tail 0.4 fins -2.5 legs 0.5 Total Score -3.4 Categorization Not Duck
 Pet #2 Description There are fluffy feathers everywhere all the time! He chipped his beak on a large rock the other day. I think I startled him, because his wings flew wide, and then he lost his balance and that's when he chipped his beak. Keywords Found Weights fluffy -0.3 feathers 0.7 beak 1.2 large -2.5 wings 4.2 beak 1.2 Total Score 4.5 Categorization Duck
 Pet #3 Description This bird is insane. Why did no one tell me it would be so LOUD?? Also, it grew enormous. There should be a warning when you buy it that this is a large bird. Not some cute small fluffy thing you can hold on your finger. :( Keywords Found Weights loud -0.1 large -2.5 small 0.6 fluffy -0.3 Total Score -2.3 Categorization Not Duck

As shown here, the choice of keywords and weights is crucial. Therefore, you must make sure that your keywords and weights come from a verified set of data and that they have been analyzed properly. (For more background on this, see: Rozado, David. “Wide Range Screening of Algorithmic Bias in Word Embedding Models Using Large Sentiment Lexicons Reveals Underreported Bias Types.” PloS One, vol. 15, no. 4, PUBLIC LIBRARY SCIENCE, 21/4/2020, p. e0231189, doi:10.1371/journal.pone.0231189)

Note: An NLP program that categorizes things based on a keyword search is only as good as its ability to correctly predict things based on those keywords. The percentage of the predictions that are correct is one simple measure of how well an NLP program is working. A perfect NLP program would get 100% of its predictions correct. An NLP program that is no better than random chance would get 50% of its predictions correct.

This is a big picture view of how to complete this project. Most of the pieces listed here also have a corresponding section later in this document that provides more detail.

• Read through the project specification (DO THIS FIRST)
• Read the contents on the left so you understand the organization of these specs.
• Understand the data sources/files and how to use them; go to office hours if you do not.
• Understand the project tasks at a high level: what functions do what parts of the tasks? What tasks does the driver program (your int main() function) do? Go to office hours if you are not sure about anything.
• Understand the algorithms provided for the different project tasks; go to office hours if you have any questions.
• Sketch out a plan for how you want to write your program to implement the project tasks. Show your plan to course staff during office hours so we can help you more efficiently!
• Put all of this stuff in the same folder/directory on your computer (otherwise your program won’t be able to run)
• Register your partnership on the Autograder (if you are going to work with a partner)
• PROJECT CHECKPOINT: Implement and test the reviews.cpp library, including:
• Correctly implementing the readKeywordWeights function
• Correctly implementing the readReview function
• Correctly implementing the wordWeight function
• Correctly implementing the reviewScore function
• Using the unit_tests.cpp program to test these functions for correct behavior
• Implement and test the driver program, including:
• Correctly implementing the evaluateReviews.cpp program
• Verifying that the your report.txt file exactly matches the sample_report.txt file
• Double check style and commenting. See the Submission and Grading section for information on style grading.
• Submit all files to the autograder.

## Suggested Project Timeline

Below is a suggest project timeline that you may use. You can adjust the dates around any commitments or classwork that you have going on in other courses.

Date What to have done by this day
Friday,
November 4
• Have detailed notes on the project specs
• Know how to use the data sources/files
• Project folder is set up on your computer and has all data sources/files, starter code, and unit test scripts
• Have a list of code examples ready for you to use as templates for tasks in this project. Look at the Runestone "Common Patterns", Runestone exercises, Lab exercises, and Lecture examples.
Tuesday,
November 8
• Project Checkpoint Completed: readReview, readKeywordWeights,  wordWeight, and  reviewScore functions are written, tested, and debugged by today (includes submitting to the autograder)
• Project Partnership Registered on Autograder by today
Friday,
November 11
• The evaluateReview.cpp driver program can read in keywords and weights, read in a review, and evaluate the review's score.
• Have a plan for how to expand this process to handle multiple reviews, keeping track of the review with the highest score and the review with the lowest score.
Monday,
November 14
• The evaluateReviews.cpp driver program is written, tested, and debugged (including submitting to the autograder)
• At least one submission to the autograder includes all required file submitted and all test cases passed
• Code has been double-checked for quality: style and commenting
Tuesday,
November 15
• Project due!
• Verify that everything has been submitted to the autograder correctly.

## Things to Know Before You Get Started

### Tips and Tricks

Here are some tips to (hopefully) reduce your frustration on this project:

• Make sure to place all of your data sources (the .txt files) in the same directory as the .cpp files and the reviews.h file.

• Read the Project 3 FAQ Post on Piazza. This post has many common questions from students, so read these questions and answers before you start programming.

• Test each step of your program before you move on to the next step. For example, make sure you get the correct score of one review before you try to write a loop that goes through all of the reviews. Writing a program is process of continuous revision. It’s better (and easier in the long run) to start small, verify the program is working correctly, and then continue to add small steps as you go.

• Use cout statements to check the values of a review’s score, a review’s category, and variables that you are using to track things in your program. Use cout statements to check the value that a function returns to make sure it’s working correctly. Use cout statements everywhere! Just remember to delete them once you no longer need them so that your program doesn’t print out unnecessary information.

• The reviews.cpp file includes starter code for the four functions you need to write, but there are also some additional helper functions that have already been written for you. Don’t forget to use these helper functions when implementing your functions!

• There are four helper functions that you are required to write (these are the functions in the reviews.cpp file). You can absolutely write more helper functions of your own, though! If you do, place these helper functions in the evaluateReviews.cpp file. (Note: This is just because of how the autograder is set up. Normally, you would have complete control over your library of custom functions, but we have to make some concessions when have to somehow grade hundreds of students’ code!)

### Writing to a File in a Loop

This project requires you to write to a file from within a loop. Sometimes, during the course of development, an infinite loop may slip through your careful debugging and cause a file to grow significantly larger than you would want. Here are a few hints to detect if this is happening:

• Your program takes longer than a second to run. This project shouldn’t take more than a second (two at the absolute most).

• Your report.txt file takes up a lot of memory.

If your program takes longer than 2 seconds to run, type CTRL+C to cancel the running program.

If you are on your own computer, check how big the report.txt file is. If it is more than, say, 30 Mb, then you have an infinite loop. Delete report.txt.

If you are on a CAEN machine, navigate into the directory where your Project 3 files are located, type ls -lh at the linux terminal, and check the size of the files, like this:

bash-4.2$ls –lh total 1.1G ... -rwxr-xr-x. 1 your_uniqname users 9.4K Nov 18 18:11 evaluateReviews.cpp -rw-r--r--. 1 your_uniqname users 223 Nov 15 09:22 readKeywordWeights.cpp -rw-r--r--. 1 your_uniqname users 356 Nov 16 12:56 readReview.cpp -rw-r--r--. 1 your_uniqname users 1.0G Nov 18 18:20 report.txt -rw-r--r--. 1 your_uniqname users 447 Nov 16 15:04 reviewScore.cpp ...  The report.txt file listed above has a file size of 1.0Gb (yikes!) and should be deleted. To delete the file, use the rm command: bash-4.2$ rm report.txt


### Passing Filestreams to Functions

One of the required functions needs a filestream passed to it, and you may potentially write a helper function or two that also requires a filestream to be passed in. Remember that filestreams are linked to a specific file; therefore you need to pass the filestreams by reference (not pass by value) so that the function has access to the file that was already opened by your program.

### Undefined Values for Variables

Using a variable that has been declared, but does not yet have a value assigned to it, can cause unexpected behavior from your program. Similarly, if you write a function that has a return variable (e.g. an int, double, bool, etc. function), and you try to return a variable that does not have a value – or you forget a return statement entirely – then you can also see unexpected behavior. Some compilers will automatically assign a zero to some types of data, but others will not; therefore, you should always properly assign/initialize values to your variables. If you see that your program outputs big weird numbers on the autograder, it’s likely due to an uninitialized variable.

### Finding the Minimum/Maximum of a Set of Data

When searching, or keeping track of, the maximum/minimum of a set of data (such as the score of the hotel reviews), you generally are comparing the current value to whatever is the “current highest” or the “current lowest” value. However, for the first value, you don’t yet have anything to compare it to. A common best practice is to initialize the “current highest” and/or “current lowest” value to be the first value in your dataset. This way, no matter what the actual values in your dataset are, you will always be able to find the maximum or minimum value.

There are four required helper functions for this project that you have to write plus two helper functions that are provided to you; however, you can (and should!) look to abstract other chunks of code into helper functions. Helper functions make your code easier to read and understand and easier to debug. Some ideas for other helper functions are functions that would:

• Categorize a review
• Find the review with the highest score
• Find the review with the lowest score
• etc.

It is up to you how you want to abstract portions of your code; there is no “right” answer and no “wrong” answer. The only reason we’re requiring a few specific helper functions for this project is to enforce practice with abstraction and function writing. In general, you can design your functions however you like!

### Pass by Value vs. Pass by Reference

When writing your own helper functions, always consider whether to pass parameters by value or by reference (including const reference). If you are unsure, refer back to the Runestone chapter that had the decision tree about pass by value vs. pass by reference.

This project has three deliverables: reviews.h, reviews.cpp, and evaluateReviews.cpp. See the Deliverables section for more details.

After the due date, the Autograder portion of the project will be graded in two parts. First, the Autograder will be responsible for evaluating your submission. Second, one of our graders will evaluate your submission for style and commenting and will provide a maximum score of 10 points. Thus, the maximum total number of points on this project is 110 points.

You should still submit reviews.h, even though you aren’t supposed to modify it - the autograder will double check that the file is the same to verify you haven’t accidentally changed it, since this could mess up the rest of your code.

### Submitting Prior to the Project Deadline

Submit the .cpp files and .h file to the Autograder for grading. You do not have wait until you have all of the files ready to submit before you submit to the Autograder for the first time. In fact, we recommend that as you complete tasks for this project, you should continually submit those files to the autograder for feedback as you work on the project. You can submit a subset of files and get feedback on the test cases related to those files. However, to receive full credit, you need to submit all files and pass all test cases within a single submission.

The autograder will run a set of public tests - these are the same as the test scripts provided in this project specification. It will give you your score on these tests, as well as feedback on their output.

The autograder also runs a set of hidden tests. These are additional tests of your code (e.g. special cases). You still see your score on these tests, and you will receive some feedback on any cases that your code does not pass.

You are limited to 5 submissions on the Autograder per day. After the 5th submission, you are still able to submit, but all feedback will be hidden other than confirmation that your code was submitted. The autograder will only report a subset of the tests it runs. It is up to you to develop tests to find scenarios where your code might not produce the correct results.

You will receive a score for the autograded portion equal to the score of your best submission.

Your latest submission with the best score will be the code that is style graded. For example, let’s assume that you turned in the following:

 Submission # 1 2 3 4 Score 50 100 100 75

Then your grade for the autograded portion would be 100 points (the best score of all submissions) and Submission #3 would be style graded since it is the latest submission with the best score.

Here is the breakdown of points for style and commenting (max score of 10 points):

• 2 pts - Each submitted file has Name, Partner Uniqname (or “none”), Lab Section Number, and Date Submitted included in a comment at the top

• 2 pts - Comments are used appropriately to describe your code (e.g. major steps are explained)

• 2 pts - Indenting and white space are appropriate (including functions are properly formatted)

• 2 pts - Variables are named descriptively

• 2 pts - Other factors (Variable names aren’t all caps, etc…)

### Submitting after the project deadline

If you need to submit your project work after the deadline, you can submit to the “Late Submission” project assignment on the Autograder. The late submission assignment includes all of the same test cases as the original assignment, but the points have been adjusted down a small amount per the syllabus’ flexible deadline policy.

Your project score at the end of the semester will be whichever is the higher score between the original project assignment and the late submission assignment. You will never be penalized for submitting to the late submission version of the project.

### Test Case Descriptions and Error Messages

Each of the test cases on the Autograder is testing for specific things about your code. Often, it’s checking to see if your programs can handle “special cases” of data, such as: a different number of reviews, a different set of keyword weights, the reviews are all truthful, the reviews are all deceptive, etc.

Each of the test cases on the Autograder has a description of what it’s checking for. If your program fails a test case, the Autograder will sometimes be able to give you some advice on how to go about debugging your code. It’s like you have a friendly GSI or IA giving you immediate help!

### Checking If report.txt Is Correct

Within the Autograder tests, you can click to expand the “Check report.txt” test and see a window that compares the expected output for the file (on the left) to your output file (on the right). The Autograder is very particular about having the output match perfectly. This means every character and every whitespace must match – so beyond the score values and category, be sure each word in your report.txt is spelled correctly, has matching spaces and matching new lines. Lines that don’t correspond with the excpected output will be highlighted, and you can click “Show Whitespace” to see the right spacing and new line placement.

# Project Overview

The purpose of this project is to give you a chance to practice file input and output (file I/O) with file streams and to practice writing your own functions. It also gives more practice using strings and vectors.

## Background and Motivation

Proxima b’s new spaceport has been up and running for 6 months. The company that owns the spaceport wants to analyze the online reviews of the spaceport so that they can improve their customers’ experiences. However, they suspect that only some of the reviews are from actual customers, and that the others are fake! Unfortunately, it is difficult for humans to distinguish between truthful and deceptive reviews – studies show success rates of approximately 50-60%, which is not much better than guessing. It would also be very time consuming to examine each review by hand.

Instead, the company wants to use an automated text-classification algorithm based on machine learning and natural language processing techniques. A challenge with any machine learning approach is finding high-quality training data. Training data is data that has been independently analyzed and verified, so you can use the training data to check whether a new program is working correctly or not.

Fortunately, a dataset of carefully verified truthful and deceptive reviews is available, thanks to an ancient study from 2011 EY (Earth Years) that investigated reviews of Chicago hotels. Based on the data available in this study, data scientists working for the company have developed a set of keywords that indicate either truthfulness or deception.

The Earthlings who originally studied the hotel reviews also developed a website where people could put in a review and the review would be evaluated for truthfulness or deceptiveness. Alas, this site is no longer accessible, but old news articles of the discovery have been recovered, one of which is avaliable here. This article shows a high level idea of what the company wants to do.

Your job is to write a “proof of concept” program that evaluates reviews for truthfulness or deception. This “proof of concept” program will use the hotel reviews (not the spaceport reviews) because you want to work with the training data first in order to verify that your program works correctly. Your program should evaluate each of the hotel reviews and categorize them as truthful or deceptive using an NLP Keyword Search. The program should then print out a summary report of the reviews and identify the review with the highest score (the “most truthful review of the dataset”) and identify the review with the lowest score (the “most deceptive review of the dataset”).

The next step in this process would be to apply the program you write to evaluate the spaceport reviews. But that is a hypothetical scenario which we will NOT be doing for this project. You are ONLY looking at hotel reviews.

## Data Sources/Files

These sections describe the different sets of data you have available for implementing and testing your programs for this project. Make sure you understand the different formats and layouts of the data in the data files before you start to work with the data itself.

### Keywords and Weights File

The keywordWeights.txt file (click filename to download) contains a list of keywords and weights that resulted from analyzing an initial set of verified hotel reviews as training data. Each keyword will be on its own line, and there will be no multi-word phrases (e.g. “good” and “view” instead of “good view”). Here is a schematic of what the keywordWeights.txt file looks like:

<keyword 1> <score 1>
<keyword 2> <score 2>
<keyword 3> <score 3>
<keyword 4> <score 4>
<keyword 5> <score 5>
<...>


Read in all the keywords and weights in the file, no matter how many keywords there are, and store them in parallel vectors of string variables and double variables.

### Hotel Review Files

The reviewFiles.zip file (click filename to download) contains a set of 20 hotel reviews for you to use in testing your program. The hotel reviews all have a similar naming convention of review00.txt, review01.txt, etc. Reviews 0-9 are actually truthful; Reviews 10-19 are actually deceptive. (However, the algorithm you implement doesn’t categorize them all perfectly correctly, as you will see.)

Make sure you actually unzip the reviewFiles.zip file to get to the individual review files. If you are on a Mac, double-click the .zip file to automatically “unzip” the file and make a folder with a bunch of .txt files in it; move the files to your Project 3 folder. But Windows will often let you double-click on the .zip file and see the files but not actually unzip the files… which means your computer can’t actually access the files yet. Instead, right-click on the file and select “Extract All” to unzip/uncompress the file to actually get access to the .txt files. If you have trouble with this, please come to office hours!

Each hotel review is stored in its own file, e.g. review18.txt. The text is in “paragraph” form; here is a schematic of what a hotel review file looks like:

<word 1> <word 2> <word 3> <word 4> <word 5>
<word 6> <word 7> <word 8> <word 9> <word 10> <word 11>
<word 12> <word 13> <word 14> <...>


To work with a hotel review, open the file, read in each word, and store the words in a vector of string variables.

On the Autograder, you are guaranteed that there will always be at least one review and that there will be no gaps in the numbering of the hotel reviews. There will be no more than 100 hotel reviews.

## Deliverables

This project has three deliverables:

File Description
reviews.h a C++ header file for the reviews.cpp library
(This file is already provided to you, but it is part of this program's set of files, so it is considered a deliverable.)
reviews.cpp a C++ file that contains all of the required helper functions used in the project
evaluateReviews.cpp a C++ file that contains your main() function and is the driver program file. Any helper functions that you write in addition to those should be correctly declared and defined in the evaluateReviews.cpp file.

## Starter Code and Test Programs

The starter code files contain some code that is already written for you, and you will write the rest of the code needed to implement the tasks described in the Project Task Description section. The starter code files have _starter appended to the file’s name so that if you want to download a fresh copy of the starter code, you won’t accidentally overwrite an existing version of the file that may have some code you have written in it. Remember to remove the _starter part of the filename so that your programs will run correctly!

The Reviews library has several functions that you will be using when you write your driver program in evaluateReviews.cpp. You should carefully test the functions you write in reviews.cpp before trying to use them in evaluateReviews.cpp. Here is a driver program that contains some basics tests for some of the functions in reviews.cpp (click the filename to download):

You can (and should!) add additional tests to the unit_tests.cpp program to thoroughly understand and test all of the functions in the Reviews library. See the Testing Your Functions Section for more information about how to use the unit_tests.cpp file.

Once you have verified that all of the functions in the Reviews library work, you can start working on the driver program that is evaluateReviews.cpp. The driver program creats a summary report saved as report.txt. Here is a sample report that shows what your report.txt file should look like if your program works correctly (click the filename to download):

See the Test Case for Evaluating Reviews for more information about how to use the sample_report.txt file.

## Compiling and Running the Program

To compile the program, use the following compile command:

g++ -std=c++11 -Wall -pedantic evaluateReviews.cpp reviews.cpp -o evaluateReviews

• The starter code for the project uses some features only available in C++11 (a more modern version of the language), so we need the -std=c++11 flag.
• The -Wall flag includes all warnings from the compiler; this will help you catch bugs. Warnings are diagnostic messages that report things in your code that are not inherently erroneous but that are risky or suggest there may have been an error (or may cause an error later on when you run the program).
• The -pedantic flag tells the compiler to look for anything that you did that might not work on other people’s computer (including the autograder). Mostly, this will check to see if you forgot to initialize a variable because this may cause you to fail a test case on the Autograder.
• Both evaluateReviews.cpp and reviews.cpp are included in the compile command, but not reviews.h. Header files are incorporated using #include at the top of .cpp files, but are never provided to the compile command directly.

Once compiled, the program can be run from the command line with:

./evaluateReviews

1. Create a library of helper functions to assist with processing the hotel reviews
2. Evaluate hotel reviews as truthful, deceptive, or uncategorized, and create a summary report of the analysis

These tasks are described in more detail in the next secions.

The first task in this project is to implement a library of several helper functions that support working with and processing the hotel reviews. A C++ library needs two files: a .h file that contains the interface for the library, and a .cpp file that contains the implementation of the library’s functions. Click the filenames to download the starter files for your Reviews library:

• reviews.h - Contains function declarations/signatures for the functions in the Reviews library. You can refer to this file for an overview of the functions in the library, but DO NOT change anything in reviews.h.

• reviews_starter.cpp - The actual implementations of the functions in the Reviews library. Some of the functions are written for you, and you will write the rest of the functions.

Don’t forget to remove _starter from the filename before you try to compile with this file!

### Functions in the Reviews Library

There are two functions in the Reviews library that are written for you: makeReviewFilename and preprocessReview. A brief description is here, and more details can be found in the comments in the code:

• makeReviewFilename - Returns the appropriate file name for a particular review number. For example, makeReviewFilename(0) returns "review00.txt" and makeReviewFilename(5) returns "review05.txt".
• preprocessReview - Modifies a review, represented as a vector of individual words, by changing each word to lowercase letters, removing punctuation, and replacing any strings representing numbers (e.g. “1”, “7”, “100”) with the string "<number>".

You are responsible for writing the implementations of the remaining functions in reviews.cpp. There are four such functions, listed briefly here and described in more detail in the following sections:

• readKeywordWeights - Reads keywords and their weights from an input stream.
• readReview - Reads a review from an input stream into a vector of words.
• wordWeight - Finds the weight of a given word based on the keywords and their weights.
• reviewScore - Computes the score for a review by adding up the weights of its words.

These four functions are described below, and additional information is included as comments in the reviews.cpp starter file. Read these comments and consider them part of the project specification. You will also conduct unit tests on the individual helper functions so you can ensure they are working correctly.

### The readKeywordWeights Function

This is a function that will read in the keywords and their numerical weights from an input stream. The functions stores each word in the text file to a vector of string variables and stores the corresponding weights into a vector of double variables – thereby creating two parallel vectors for the keywords and their weights.

#### Description

// Reads in keywords and corresponding weights from an input stream and stores them into
// the 'keywords' and 'weights' vectors in the same order as they appear in the file.
// NOTE: They keywords in the file have already been preprocessed (e.g. to remove punctuation),
//       so you do not have to do that here.
// PARAMETERS:
//   input - An input stream from which keywords and weights are read. For this project, we
//           assume the input stream is a file input stream, where the file format is that
//           provided in the project specification.
//   keywords - An "output parameter", passed by reference, into which the keywords are stored.
//   weights - An "output parameter", passed by reference, into which the weights are stored.
void readKeywordWeights(istream &input, vector<string> &keywords, vector<double> &weights) {
// TODO: Write an implementation for this function!
}


#### Algorithm

This function is passed a filestream connected to a text file, an empty vector of string variables, and an empty vector of double variables. The vector of strings gets “filled up” by the words in the text file, and the vector of doubles gets “filled up” by the numbers in the text file.

Runestone includes two common patterns that are particularly applicable for this function:

• Strings, Streams, and I/O - Reading In Multiple Pieces of Data
• Vectors - “Fill As You Go”

Review the examples in Runestone for these patterns and consider how to adapt the examples for what you need to do here.

### The readReview Function

This is a function that will read in words from a review text file and store each word in the text file to a vector of string variables.

#### Description

// Reads in a review from an input stream and stores each individual word from the review
// into the vector 'reviewWords', in the same order they appeared in the input.
// PARAMETERS:
//   input - An input stream from which the review is read. For this project, we assume the
//           input stream is a file input stream, with words separated by whitespace.and weights are read.
//   reviewWords - An "output parameter", passed by reference, into which the review words are stored.
void readReview(istream &input, vector<string> &reviewWords) {
// TODO: Write an implementation for this function!
}


#### Algorithm

This function passes in a filestream connected to a text file, and it passes in an empty vector of string variables. The vector of strings gets “filled up” by the words in the text file.

Runestone includes two common patterns that are particularly applicable for this function:

• Strings, Streams, and I/O - Reading Until the End
• Vectors - “Fill As You Go”

Review the examples in Runestone for these patterns and consider how to adapt the examples for what you need to do here.

### The wordWeight Function

This function determines the weight of a word in one of the hotel reviews. If the word matches one of the keywords, then the function returns the value of the keyword’s weight; otherwise, the function returns 0.0.

#### Description

// Returns the weight of a given word by looking it up in the provided vectors.
// The keywords and their corresponding weights are provided as vector parameters.
// It is assumed that these are parallel vectors, so that weights[i] is the weight of keywords[i].
// If a word does not appear in the keywords vector, its weight is zero.
// PARAMETERS:
//   word - The word to be looked up
//   keywords - A vector containing all keywords.
//   weights - A vector containing weights corresponding to each keyword.
double wordWeight(const string &word, const vector<string> &keywords, const vector<double> &weights) {
// TODO: Write an implementation for this function!
}


#### Algorithm

This is a function that is passed one word from the vector that represents a hotel review. The function is also passed the parallel vectors that contain the keywords and their corresponding weights.

The function checks the review’s word against each of the keywords to see if they match. If a match is found, the function returns the weight of the keyword that matches the review word. If no match is found, the function should return 0.0 for the word’s weight.

Runestone includes two common patterns that are particularly applicable for this function:

• Vectors - Searching for a Value
• Vectors - Accessing Parallel Vectors

Review the examples in Runestone for these patterns and consider how to adapt the examples for what you need to do here.

### The reviewScore Function

This function calculates the score of a review. The review’s score is the sum of the weights of all the words in the review, so the wordWeight function will be helpful here!

#### Description

// Computes and returns the overall score for a review. This is the sum of the weights of
// the individual words in the review. Note that a word may appear more than once in the review,
// and if this happens it's weight is added in multiple times as well. The keywords and their
// corresponding weights are provided as vector parameters. It is assumed that these are parallel
// vectors, so that weights[i] is the weight of keywords[i]. If a word does not appear in the
// keywords vector, its weight is zero.
// HINT: Make a copy of the reviewWords vector using a separate variable. Then, call the
//       preprocessReview() function on the copy. Having a preprocessed copy of the words
//       will allow you to compare against the keywords.
// PARAMETERS:
//   reviewWords - A vector containing the individual words in the review.
//   keywords - A vector containing all keywords.
//   weights - A vector containing weights corresponding to each keyword.
double reviewScore(const vector<string> &reviewWords, const vector<string> &keywords, const vector<double> &weights) {
// TODO: Write an implementation for this function!
}


#### Algorithm

This function is passed three parameters:

• a vector of strings representing a review,
• a vector of strings representing the keywords, and
• a vector of doubles representing the keyword weights.
1. Make a copy of the review to work with in this function, so that you keep the original version to potentially work on later.
2. Preprocess the copy of the review to standardize the text and make it easier to search for keywords. Important! Call the preprocessReview helper function already written for you to do the preprocessing. Look at the description of preprocessReview to understand how to call it and what it does.
3. Iterate through each word of the preprocessed review, get its weight using the wordWeight function, and add the word’s weight to a running total score.
4. After all words are processed, the function returns the total score.

For an example of the process of scoring the review, refer to the keyword search process described earlier. Note that in this case, words which are not identified as keywords will simply have a weight of 0.0 returned from wordWeight, so that it is safe to add them in without affecting the overall score.

Runestone includes a common pattern that is particularly applicable for this function:

• Vectors - Using an Accumulator

Review the example in Runestone for this pattern and consider how to adapt the example for what you need to do here.

When working with complex programs made up of several different functions, it’s important to be able to test each function individually to make sure it is working correctly on its own. This is called unit testing. A strategy for implementing a set of unit tests is to write a separate main function (in a different file) that intentionally calls each function one at a time on a variety of inputs and confirms that the outputs from the functions match the expected correct answer.

A few sample unit tests can be found in unit_tests.cpp file provided with the project. The comments included in that file describe the way the unit testing process works. We highly encourage you to use these samples as a starting point and write additional unit tests of your own.

In order to compile and run the unit tests, use the following commands:

g++ -std=c++11 -Wall -pedantic unit_tests.cpp reviews.cpp -o unit_tests

./unit_tests

Note the difference from the compilation command for the regular program. We’ve basically kept all the review functions from reviews.cpp, but we’ve swapped in unit_tests.cpp for evaluateReviews.cpp, which means the main function containing the tests will be used instead.

You do not need to turn in the unit_tests.cpp file to the autograder.

The second task in this project is to write a driver program that will evaluate the hotel reviews as truthful, deceptive, or uncategorized, and create a summary report of the analysis. The driver program will use the functions in the Reviews library to:

• Read in the keywords and weights from a file
• Read in and evaluate hotel reviews from several different files
• Write a summary report to report.txt

### Description of evaluateReviews.cpp

The driver program is written in the evaluateReviews_starter.cpp file. Don’t forget to remove _starter from the filename before you try to compile with this file!

// Add any #includes for C++ libraries here.
// We have already included iostream as an example.
#include <iostream>

// The #include adds all the function declarations (a.k.a. prototypes) from the
// reviews.h file, which means the compiler knows about them when it is compiling
// the main function below (e.g. it can verify the parameter types and return types
// of the function declarations match the way those functions are used in main() ).
// However, the #include does not add the actual code for the functions, which is
// in reviews.cpp. This means you need to compile with a g++ command including both
// .cpp source files. For this project, we will being using some features from C++11,
// which requires an additional flag. Compile with this command:
//     g++ --std=c++11 evaluateReviews.cpp reviews.cpp -o evaluateReviews
#include "reviews.h"

using namespace std;

const double SCORE_LIMIT_TRUTHFUL = 3;
const double SCORE_LIMIT_DECEPTIVE = -3;

int main(){

// TODO: implement the main program

}


### Algorithm

This is the general algorithm for the driver program (the main function in evaluateReviews.cpp):

1. Open a file input stream for the keywordWeights.txt file.
1. If the file cannot be opened, output "Error: keywordWeights.txt could not be opened." to cout
2. Use return 1; to exit the main function (recall that a nonzero return value from main reports an error).
2. If the keyword weights file was opened, read the keywords and their weights into parallel vectors. (Which Reviews library function would be helpful here?)
3. For each hotel review,
1. Create the filename (e.g. review00.txt) (Which Reviews library function would be helpful here?)
2. Open a filestream to the file
3. Read each word of the review into a vector of string variables (Which Reviews library function would be helpful here?)
4. Calculate the review’s score (Which Reviews library function would be helpful here?)
5. Determine the review’s category:
• truthful: score > 3.0
• deceptive: score < -3.0
• uncategorized: otherwise
6. Track the review with the highest score and the review with the lowest score
4. Write out a summary of the truthfulness and deceptiveness of the reviews to a file named report.txt

You know that you have processed all the reviews once you try to open a file input stream for the next review file and it does not open successfully (because the file doesn’t exist!). You don’t need to print an error message in this case - simply have your program stop trying to read more reviews.

You are guaranteed that there will always be at least one review and that there will be no gaps in the numbering of the hotel reviews. There will be no more than 100 hotel reviews.

Make sure your summary of the reviews is being written to the report.txt file, not cout. No output should be written to cout except for the error message printed if the keywords file cannot be opened.

You may write additional helper functions in evaluateReviews.cpp! (See the Additional Helper Functions section for some suggestions.) However, do not add additional functions to reviews.cpp that you intend to use in evaluateReviews.cpp, since this would require modifying the review.h file to contain their prototypes as well, which is prohibited for this project. Instead, put your own additional helper functions into the evaluateReviews.cpp file.

### Details of the Summary Report

There is one output file for this program: report.txt. This file contains a summary report of your hotel review analysis. The summary report should contain:

• A header line, reading exactly review score category
• A line of information for each report including the following (separated by spaces)
• Review number (0, 1, 2, etc.)
• The overall score of the review
• The categorization (truthful, deceptive, or uncategorized)
• [an extra blank line]
• The total number of reviews analyzed
• The total number of truthful reviews
• The total number of deceptive reviews
• The total number of uncategorized reviews
• [an extra blank line]
• The number of the review with the highest score (You may assume there are no “ties”.)
• The number of the review with the lowest score (You may assume there are no “ties”.)

### Test Case for Evaluating Reviews

Test cases are very important for this project. You should first use unit tests for your helper functions so you know they are working correctly. Once you have verified that your helper functions work correctly, continue developing your code to eventually create the report.txt file that summarizes the analysis.

You are provided with 20 hotel reviews as test data. Reviews 0-9 are known to be truthful and reviews 10-19 are known to be deceptive; although, your algorithm won’t be able to correctly categorize ALL of the reviews. The sample_report.txt file contains the correct output for your report.txt file. Make sure you can recreate what is in sample_report.txt exactly!

review score category
0 14.88 truthful
1 3.33 truthful
2 15.68 truthful
3 4.43 truthful
4 4.14 truthful
5 11.29 truthful
6 20.61 truthful
7 -2.89 uncategorized
8 2.71 uncategorized
9 11.93 truthful
10 0.03 uncategorized
11 -13.06 deceptive
12 -3.66 deceptive
13 -8.46 deceptive
14 5.18 truthful
15 -18.68 deceptive
16 -17.88 deceptive
17 -21.61 deceptive
18 -11.25 deceptive
19 -15.08 deceptive

Number of reviews: 20
Number of truthful reviews: 9
Number of deceptive reviews: 8
Number of uncategorized reviews: 3

Review with highest score: 6
Review with lowest score: 17