ENGR 101 | University of Michigan
Project 3: Spaceport Reviews
Project Checkpoint and Project Partnership Registration Deadline: Tuesday, November 5, 2024
Project Due: Tuesday, November 12, 2024
Late Submissions Due: Tuesday, November 19, 2024
Engineering intersections: Data Science/Industrial & Operations Engineering/Computer Science
Implementing this project provides an opportunity to use C++ file I/O, functions, strings, and vectors.
The autograded portion of the final submission is worth 100 points, and the style and comments grade is worth 10 points, for a total of 110 points.
You may work alone or with a partner. Please see the syllabus for partnership rules.
Educational Objectives and Resources
This project has two aspects. First, this project is an educational tool within the context of ENGR 101, designed to help you learn certain programming skills. This project also serves as an example of a project that you might be given at an internship or job in the future; therefore, we have structured the project description in way that is similar to the kind of document you might be given in your future professional capacity.
Educational Objectives
The purpose of this project is to help you strengthen your skill at working with a large set of textual data, including writing and calling helper functions to help process the data. The C++ programming language is very good at processing quickly and efficiently processing text data, so it is a good choice for this type of work.
This project uses a simple type of Natural Language Processing (NLP). A short introduction to NLP, including the approach used in this project, is in the next section.
Engineering Concepts - Natural Language Processing (NLP)
Natural Language Processing (NLP) is a field at the intersection of computer science and linguistics that designs computer programs that can interpret, categorize, and act on human languages. Human languages are incredibly complex and flexible, and designing a computer program that can understand and respond to even a subset of a single language is notoriously difficult. However, researchers have continually pushed forward in this field, and now NLP has wide-ranging applications including translation between different human languages, virtual assistants, and automated feedback on writing assignments in large courses.
Keyword Search
One of the most basic forms of NLP is to search through a body of text looking for keywords, and condition a program’s behavior on the presence or frequency of the keywords. Keywords may also have a numeric weight, which determines how significant they are. For example, say you are interested in determining the type of pet a person has, based on a written description. So, you ask a bunch of people to describe their pets. Some people tell you what kind of pet they have (e.g. duck, rabbit, horse, cat, dog, snake, etc.), but some don’t. Some of them give you sort of a “review” of their pet, and so you need to figure out what kind of pet they have based on a set of keywords.
Let’s say you want to determine whether someone has a duck for a pet. You could analyze keywords of pet descriptions that you have verified are about ducks and not-ducks, and then assign a weight to each keyword based on how strongly it is associated with ducks as pets (Table 1). Positive values are associated with ducks; negative values are associated with not-ducks. The larger the values (regardless of positive/negative), the stronger the association.
You could then read in another pet’s description and find occurrences of the keywords in the list. If the keyword occurs, add its weight to a total running “score” for the pet description. If the total score is positive, it is categorized as a duck. If the total score is negative, it is categorized as a not-duck. See Table 2 for a few examples of this.
Table 1. Example of keywords and weights for describing ducks as pets.
Keyword | Weight | Keyword | Weight |
beak | 1.2 | large | -2.5 |
egg | 1.5 | legs | 0.5 |
eyes | 0.3 | loud | -0.1 |
feather | 0.7 | small | 0.6 |
fins | -2.5 | tail | 0.4 |
fluffy | -0.3 | teeth | -0.5 |
fur | -1.0 | water | 1.4 |
gills | -3.2 | wings | 4.2 |
Table 2. Example analysis of pet descriptions using the set of keywords and weights for describing a duck. (In case you’re wondering, Pet #3 is a Great Malay Argus.)
Pet #1 | |
Description | |
I have to make sure to clean her tank every day, so the water doesn't get dirty. I love to watch her gills move as she breathes and her tail and fins move as she swims. Sometimes, though, I wish she had legs so I could take her for a walk. | |
Keywords Found | Weights |
water | 1.4 |
gills | -3.2 |
tail | 0.4 |
fins | -2.5 |
legs | 0.5 |
Total Score | -3.4 |
Categorization | Not Duck |
Pet #2 | |
Description | |
There are fluffy feathers everywhere all the time! He chipped his beak on a large rock the other day. I think I startled him, because his wings flew wide, and then he lost his balance and that's when he chipped his beak. | |
Keywords Found | Weights |
fluffy | -0.3 |
feathers | 0.7 |
beak | 1.2 |
large | -2.5 |
wings | 4.2 |
beak | 1.2 |
Total Score | 4.5 |
Categorization | Duck |
Pet #3 | |
Description | |
This bird is insane. Why did no one tell me it would be so LOUD?? Also, it grew enormous. There should be a warning when you buy it that this is a large bird. Not some cute small fluffy thing you can hold on your finger. :( | |
Keywords Found | Weights |
loud | -0.1 |
large | -2.5 |
small | 0.6 |
fluffy | -0.3 |
Total Score | -2.3 |
Categorization | Not Duck |
As shown here, the choice of keywords and weights is crucial. Therefore, you must make sure that your keywords and weights come from a verified set of data and that they have been analyzed properly. (For more background on this, see: Rozado, David. “Wide Range Screening of Algorithmic Bias in Word Embedding Models Using Large Sentiment Lexicons Reveals Underreported Bias Types.” PloS One, vol. 15, no. 4, PUBLIC LIBRARY SCIENCE, 21/4/2020, p. e0231189, doi:10.1371/journal.pone.0231189)
Note: An NLP program that categorizes things based on a keyword search is only as good as its ability to correctly predict things based on those keywords. The percentage of the predictions that are correct is one simple measure of how well an NLP program is working. A perfect NLP program would get 100% of its predictions correct. An NLP program that is no better than random chance would get 50% of its predictions correct.
Project Roadmap
This is a big picture view of how to complete this project. Most of the pieces listed here also have a corresponding section later in this document that provides more detail.
- Read through the project specification (DO THIS FIRST)
- Read the contents on the left so you understand the organization of these specs.
- Understand the data sources/files and how to use them; go to office hours if you do not.
- Understand the project tasks at a high level: what functions do what parts of the tasks? What tasks does the driver program (your
int main()
function) do? Go to office hours if you are not sure about anything. - Understand the algorithms provided for the different project tasks; go to office hours if you have any questions.
- Sketch out a plan for how you want to write your program to implement the project tasks. Show your plan to course staff during office hours so we can help you more efficiently!
- Prepare your workspace
- Download all data sources files/functions
- Download all provided starter code
- Download all provided unit tests and unit test programs
- Put all of this stuff in the same folder/directory on your computer (otherwise your program won’t be able to run)
- Register your partnership on the Autograder (if you are going to work with a partner)
- PROJECT CHECKPOINT: Implement and test the
reviews.cpp
library, including:- Correctly implementing the
readKeywordWeights
function - Correctly implementing the
readReview
function - Correctly implementing the
wordWeight
function - Correctly implementing the
reviewScore
function - Using the
unit_tests.cpp
program to test these functions for correct behavior
- Correctly implementing the
- Implement and test the driver program, including:
- Correctly implementing the
evaluateReviews.cpp
program - Verifying that the your
report.txt
file exactly matches thesample_report.txt
file
- Correctly implementing the
- Double check style and commenting. See the Submission and Grading section for information on style grading.
- Submit all files to the autograder.
Suggested Project Timeline
Below is a suggest project timeline that you may use. You can adjust the dates around any commitments or classwork that you have going on in other courses.
Date | What to have done by this day |
Friday, November 1 |
|
Tuesday, November 5th |
|
Friday, November 8th |
|
Monday, November 11th |
|
Tuesday, November 12th |
|
Things to Know Before You Get Started
Tips and Tricks
Here are some tips to (hopefully) reduce your frustration on this project:
-
Make sure to place all of your data sources (the
.txt
files) in the same directory as the.cpp
files and thereviews.h
file. -
Read the Project 3 FAQ Post on Piazza. This post has many common questions from students, so read these questions and answers before you start programming.
-
Test each step of your program before you move on to the next step. For example, make sure you get the correct score of one review before you try to write a loop that goes through all of the reviews. Writing a program is process of continuous revision. It’s better (and easier in the long run) to start small, verify the program is working correctly, and then continue to add small steps as you go.
-
Use
cout
statements to check the values of a review’s score, a review’s category, and variables that you are using to track things in your program. Usecout
statements to check the value that a function returns to make sure it’s working correctly. Usecout
statements everywhere! Just remember to delete them once you no longer need them so that your program doesn’t print out unnecessary information. -
The
reviews.cpp
file includes starter code for the four functions you need to write, but there are also some additional helper functions that have already been written for you. Don’t forget to use these helper functions when implementing your functions! -
There are four helper functions that you are required to write (these are the functions in the
reviews.cpp
file). You can absolutely write more helper functions of your own, though! If you do, place these helper functions in theevaluateReviews.cpp
file. (Note: This is just because of how the autograder is set up. Normally, you would have complete control over your library of custom functions, but we have to make some concessions when have to somehow grade hundreds of students’ code!)
Writing to a File in a Loop
This project requires you to write to a file from within a loop. Sometimes, during the course of development, an infinite loop may slip through your careful debugging and cause a file to grow significantly larger than you would want. Here are a few hints to detect if this is happening:
-
Your program takes longer than a second to run. This project shouldn’t take more than a second (two at the absolute most).
-
Your
report.txt
file takes up a lot of memory.
If your program takes longer than 2 seconds to run, type CTRL+C to cancel the running program.
If you are on your own computer, check how big the report.txt
file is. If it is more than, say, 30 Mb, then you have an infinite loop. Delete report.txt
.
If you are on a CAEN machine, navigate into the directory where your Project 3 files are located, type ls -lh
at the linux terminal, and check the size of the files, like this:
bash-4.2$ ls –lh
total 1.1G
...
-rwxr-xr-x. 1 your_uniqname users 9.4K Nov 18 18:11 evaluateReviews.cpp
-rw-r--r--. 1 your_uniqname users 223 Nov 15 09:22 readKeywordWeights.cpp
-rw-r--r--. 1 your_uniqname users 356 Nov 16 12:56 readReview.cpp
-rw-r--r--. 1 your_uniqname users 1.0G Nov 18 18:20 report.txt
-rw-r--r--. 1 your_uniqname users 447 Nov 16 15:04 reviewScore.cpp
...
The report.txt
file listed above has a file size of 1.0Gb (yikes!) and should be deleted. To delete the file, use the rm
command:
bash-4.2$ rm report.txt
Passing Filestreams to Functions
One of the required functions needs a filestream passed to it, and you may potentially write a helper function or two that also requires a filestream to be passed in. Remember that filestreams are linked to a specific file; therefore you need to pass the filestreams by reference (not pass by value) so that the function has access to the file that was already opened by your program.
Undefined Values for Variables
Using a variable that has been declared, but does not yet have a value assigned to it, can cause unexpected behavior from your program. Similarly, if you write a function that has a return variable (e.g. an int
, double
, bool
, etc. function), and you try to return a variable that does not have a value – or you forget a return statement entirely – then you can also see unexpected behavior. Some compilers will automatically assign a zero to some types of data, but others will not; therefore, you should always properly assign/initialize values to your variables. If you see that your program outputs big weird numbers on the autograder, it’s likely due to an uninitialized variable.
Finding the Minimum/Maximum of a Set of Data
When searching, or keeping track of, the maximum/minimum of a set of data (such as the score of the hotel reviews), you generally are comparing the current value to whatever is the “current highest” or the “current lowest” value. However, for the first value, you don’t yet have anything to compare it to. A common best practice is to initialize the “current highest” and/or “current lowest” value to be the first value in your dataset. This way, no matter what the actual values in your dataset are, you will always be able to find the maximum or minimum value.
Additional Helper Functions
There are four required helper functions for this project that you have to write plus two helper functions that are provided to you; however, you can (and should!) look to abstract other chunks of code into helper functions. Helper functions make your code easier to read and understand and easier to debug. Some ideas for other helper functions are functions that would:
- Categorize a review
- Find the review with the highest score
- Find the review with the lowest score
- etc.
It is up to you how you want to abstract portions of your code; there is no “right” answer and no “wrong” answer. The only reason we’re requiring a few specific helper functions for this project is to enforce practice with abstraction and function writing. In general, you can design your functions however you like!
Pass by Value vs. Pass by Reference
When writing your own helper functions, always consider whether to pass parameters by value or by reference (including const
reference). If you are unsure, refer back to the Homework assignment that had the decision tree about pass by value vs. pass by reference.
Submission and Grading
This project has three deliverables: reviews.h
, reviews.cpp
, and evaluateReviews.cpp
. See the Deliverables section for more details.
After the due date, the Autograder portion of the project will be graded in two parts. First, the Autograder will be responsible for evaluating your submission. Second, one of our graders will evaluate your submission for style and commenting and will provide a maximum score of 10 points. Thus, the maximum total number of points on this project is 110 points.
You should still submit reviews.h
, even though you aren’t supposed to modify it - the autograder will double check that the file is the same to verify you haven’t accidentally changed it, since this could mess up the rest of your code.
Submitting Prior to the Project Deadline
Submit the .cpp
files and .h
file to the Autograder for grading. You do not have wait until you have all of the files ready to submit before you submit to the Autograder for the first time. In fact, we recommend that as you complete tasks for this project, you should continually submit those files to the autograder for feedback as you work on the project. You can submit a subset of files and get feedback on the test cases related to those files. However, to receive full credit, you need to submit all files and pass all test cases within a single submission.
The autograder will run a set of public tests - these are the same as the test scripts provided in this project specification. It will give you your score on these tests, as well as feedback on their output.
The autograder also runs a set of hidden tests. These are additional tests of your code (e.g. special cases). You still see your score on these tests, and you will receive some feedback on any cases that your code does not pass.
You are limited to 5 submissions on the Autograder per day. After the 5th submission, you are still able to submit, but all feedback will be hidden other than confirmation that your code was submitted. The autograder will only report a subset of the tests it runs. It is up to you to develop tests to find scenarios where your code might not produce the correct results.
You will receive a score for the autograded portion equal to the score of your best submission.
Your latest submission with the best score will be the code that is style graded. For example, let’s assume that you turned in the following:
Submission # | 1 | 2 | 3 | 4 |
Score | 50 | 100 | 100 | 75 |
Then your grade for the autograded portion would be 100 points (the best score of all submissions) and Submission #3 would be style graded since it is the latest submission with the best score.
Please refer to the syllabus for more information regarding partner groups and general information about the Autograder.
Here is the breakdown of points for style and commenting (max score of 10 points):
-
2 pts - Each submitted file has Name, Partner Uniqname (or “none”), Lab Section Number, and Date Submitted included in a comment at the top
-
2 pts - Comments are used appropriately to describe your code (e.g. major steps are explained)
-
2 pts - Indenting and white space are appropriate (including functions are properly formatted)
-
2 pts - Variables are named descriptively
-
2 pts - Other factors (Variable names aren’t all caps, etc…)
Submitting after the project deadline
If you need to submit your project work after the deadline, you can submit to the “Late Submission” project assignment on the Autograder. The late submission assignment includes all of the same test cases as the original assignment, but the points have been adjusted down a small amount per the syllabus’ flexible deadline policy.
Your project score at the end of the semester will be whichever is the higher score between the original project assignment and the late submission assignment. You will never be penalized for submitting to the late submission version of the project.
Autograder Details
Test Case Descriptions and Error Messages
Each of the test cases on the Autograder is testing for specific things about your code. Often, it’s checking to see if your programs can handle “special cases” of data, such as: a different number of reviews, a different set of keyword weights, the reviews are all truthful, the reviews are all deceptive, etc.
Each of the test cases on the Autograder has a description of what it’s checking for. If your program fails a test case, the Autograder will sometimes be able to give you some advice on how to go about debugging your code. It’s like you have a friendly GSI or IA giving you immediate help!
Checking If report.txt
Is Correct
Within the Autograder tests, you can click to expand the “Check report.txt” test and see a window that compares the expected output for the file (on the left) to your output file (on the right). The Autograder is very particular about having the output match perfectly. This means every character and every whitespace must match – so beyond the score values and category, be sure each word in your report.txt
is spelled correctly, has matching spaces and matching new lines. Lines that don’t correspond with the excpected output will be highlighted, and you can click “Show Whitespace” to see the right spacing and new line placement.
Project Overview
The purpose of this project is to give you a chance to practice file input and output (file I/O) with file streams and to practice writing your own functions. It also gives more practice using strings and vectors.
Background and Motivation
Proxima b’s new spaceport has been up and running for 6 months. The company that owns the spaceport wants to analyze the online reviews of the spaceport so that they can improve their customers’ experiences. However, they suspect that only some of the reviews are from actual customers, and that the others are fake! Unfortunately, it is difficult for humans to distinguish between truthful and deceptive reviews – studies show success rates of approximately 50-60%, which is not much better than guessing. It would also be very time consuming to examine each review by hand.
Instead, the company wants to use an automated text-classification algorithm based on machine learning and natural language processing techniques. A challenge with any machine learning approach is finding high-quality training data. Training data is data that has been independently analyzed and verified, so you can use the training data to check whether a new program is working correctly or not.
Fortunately, a dataset of carefully verified truthful and deceptive reviews is available, thanks to an ancient study from 2011 EY (Earth Years) that investigated reviews of Chicago hotels. Based on the data available in this study, data scientists working for the company have developed a set of keywords that indicate either truthfulness or deception.
The Earthlings who originally studied the hotel reviews also developed a website where people could put in a review and the review would be evaluated for truthfulness or deceptiveness. Alas, this site is no longer accessible, but old news articles of the discovery have been recovered, one of which is avaliable here. This article shows a high level idea of what the company wants to do.
Your Job
Your job is to write a “proof of concept” program that evaluates reviews for truthfulness or deception. This “proof of concept” program will use the hotel reviews (not the spaceport reviews) because you want to work with the training data first in order to verify that your program works correctly. Your program should evaluate each of the hotel reviews and categorize them as truthful or deceptive using an NLP Keyword Search. The program should then print out a summary report of the reviews and identify the review with the highest score (the “most truthful review of the dataset”) and identify the review with the lowest score (the “most deceptive review of the dataset”).
The next step in this process would be to apply the program you write to evaluate the spaceport reviews. But that is a hypothetical scenario which we will NOT be doing for this project. You are ONLY looking at hotel reviews.
Data Sources/Files
These sections describe the different sets of data you have available for implementing and testing your programs for this project. Make sure you understand the different formats and layouts of the data in the data files before you start to work with the data itself.
Keywords and Weights File
The keywordWeights.txt
file (click filename to download) contains a list of keywords and weights that resulted from analyzing an initial set of verified hotel reviews as training data. Each keyword will be on its own line, and there will be no multi-word phrases (e.g. “good” and “view” instead of “good view”). Here is a schematic of what the keywordWeights.txt
file looks like:
<keyword 1> <score 1>
<keyword 2> <score 2>
<keyword 3> <score 3>
<keyword 4> <score 4>
<keyword 5> <score 5>
<...>
Read in all the keywords and weights in the file, no matter how many keywords there are, and store them in parallel vectors of string
variables and double
variables.
Hotel Review Files
The reviewFiles.zip
file (click filename to download) contains a set of 20 hotel reviews for you to use in testing your program. The hotel reviews all have a similar naming convention of review00.txt
, review01.txt
, etc. Reviews 0-9 are actually truthful; Reviews 10-19 are actually deceptive. (However, the algorithm you implement doesn’t categorize them all perfectly correctly, as you will see.)
Make sure you actually unzip the reviewFiles.zip
file to get to the individual review files. If you are on a Mac, double-click the .zip
file to automatically “unzip” the file and make a folder with a bunch of .txt
files in it; move the files to your Project 3 folder. But Windows will often let you double-click on the .zip
file and see the files but not actually unzip the files… which means your computer can’t actually access the files yet. Instead, right-click on the file and select “Extract All” to unzip/uncompress the file to actually get access to the .txt
files. If you have trouble with this, please come to office hours!
Each hotel review is stored in its own file, e.g. review18.txt
. The text is in “paragraph” form; here is a schematic of what a hotel review file looks like:
<word 1> <word 2> <word 3> <word 4> <word 5>
<word 6> <word 7> <word 8> <word 9> <word 10> <word 11>
<word 12> <word 13> <word 14> <...>
To work with a hotel review, open the file, read in each word, and store the words in a vector of string
variables.
On the Autograder, you are guaranteed that there will always be at least one review and that there will be no gaps in the numbering of the hotel reviews. There will be no more than 100 hotel reviews.
Deliverables
This project has three deliverables:
File | Description |
reviews.h |
a C++ header file for the reviews.cpp library (This file is already provided to you, but it is part of this program's set of files, so it is considered a deliverable.) |
reviews.cpp |
a C++ file that contains all of the required helper functions used in the project |
evaluateReviews.cpp |
a C++ file that contains your main() function and is the driver program file. Any helper functions that you write in addition to those should be correctly declared and defined in the evaluateReviews.cpp file. |
Starter Code and Test Programs
The starter code files contain some code that is already written for you, and you will write the rest of the code needed to implement the tasks described in the Project Task Description section. The starter code files have _starter
appended to the file’s name so that if you want to download a fresh copy of the starter code, you won’t accidentally overwrite an existing version of the file that may have some code you have written in it. Remember to remove the _starter
part of the filename so that your programs will run correctly!
Click the filenames to download the starter code:
The Reviews library has several functions that you will be using when you write your driver program in evaluateReviews.cpp
. You should carefully test the functions you write in reviews.cpp
before trying to use them in evaluateReviews.cpp
. Here is a driver program that contains some basics tests for some of the functions in reviews.cpp
(click the filename to download):
You can (and should!) add additional tests to the unit_tests.cpp
program to thoroughly understand and test all of the functions in the Reviews library. See the Testing Your Functions Section for more information about how to use the unit_tests.cpp
file.
Once you have verified that all of the functions in the Reviews library work, you can start working on the driver program that is evaluateReviews.cpp
. The driver program creats a summary report saved as report.txt
. Here is a sample report that shows what your report.txt
file should look like if your program works correctly (click the filename to download):
See the Test Case for Evaluating Reviews for more information about how to use the sample_report.txt
file.
Compiling and Running the Program
To compile the program, use the following compile command:
g++ -std=c++11 -Wall -pedantic evaluateReviews.cpp reviews.cpp -o evaluateReviews
Notes about compiling this program:
- The starter code for the project uses some features only available in C++11 (a more modern version of the language), so we need the
-std=c++11
flag. - The
-Wall
flag includes all warnings from the compiler; this will help you catch bugs. Warnings are diagnostic messages that report things in your code that are not inherently erroneous but that are risky or suggest there may have been an error (or may cause an error later on when you run the program). - The
-pedantic
flag tells the compiler to look for anything that you did that might not work on other people’s computer (including the autograder). Mostly, this will check to see if you forgot to initialize a variable because this may cause you to fail a test case on the Autograder. - Both
evaluateReviews.cpp
andreviews.cpp
are included in the compile command, but notreviews.h
. Header files are incorporated using#include
at the top of.cpp
files, but are never provided to the compile command directly.
Once compiled, the program can be run from the command line with:
./evaluateReviews
Project Task Description
There are two primary tasks for your program:
- Create a library of helper functions to assist with processing the hotel reviews
- Evaluate hotel reviews as truthful, deceptive, or uncategorized, and create a summary report of the analysis
These tasks are described in more detail in the next secions.
Task 1: Reviews Library
The first task in this project is to implement a library of several helper functions that support working with and processing the hotel reviews. A C++ library needs two files: a .h
file that contains the interface for the library, and a .cpp
file that contains the implementation of the library’s functions. Click the filenames to download the starter files for your Reviews library:
-
reviews.h
- Contains function declarations/signatures for the functions in the Reviews library. You can refer to this file for an overview of the functions in the library, but DO NOT change anything inreviews.h
. -
reviews_starter.cpp
- The actual implementations of the functions in the Reviews library. Some of the functions are written for you, and you will write the rest of the functions.
Don’t forget to remove _starter
from the filename before you try to compile with this file!
Functions in the Reviews Library
There are two functions in the Reviews library that are written for you: makeReviewFilename
and preprocessReview
. A brief description is here, and more details can be found in the comments in the code:
makeReviewFilename
- Returns the appropriate file name for a particular review number. For example,makeReviewFilename(0)
returns"review00.txt"
andmakeReviewFilename(5)
returns"review05.txt"
.preprocessReview
- Modifies a review, represented as a vector of individual words, by changing each word to lowercase letters, removing punctuation, and replacing any strings representing numbers (e.g. “1”, “7”, “100”) with the string"<number>"
.
You are responsible for writing the implementations of the remaining functions in reviews.cpp
. There are four such functions, listed briefly here and described in more detail in the following sections:
readKeywordWeights
- Reads keywords and their weights from an input stream.readReview
- Reads a review from an input stream into a vector of words.wordWeight
- Finds the weight of a given word based on the keywords and their weights.reviewScore
- Computes the score for a review by adding up the weights of its words.
These four functions are described below, and additional information is included as comments in the reviews.cpp
starter file. Read these comments and consider them part of the project specification. You will also conduct unit tests on the individual helper functions so you can ensure they are working correctly.
The readKeywordWeights
Function
This is a function that will read in the keywords and their numerical weights from an input stream. The functions stores each word in the text file to a vector of string
variables and stores the corresponding weights into a vector of double
variables – thereby creating two parallel vectors for the keywords and their weights.
Description
// Reads in keywords and corresponding weights from an input stream and stores them into
// the 'keywords' and 'weights' vectors in the same order as they appear in the file.
// NOTE: They keywords in the file have already been preprocessed (e.g. to remove punctuation),
// so you do not have to do that here.
// PARAMETERS:
// input - An input stream from which keywords and weights are read. For this project, we
// assume the input stream is a file input stream, where the file format is that
// provided in the project specification.
// keywords - An "output parameter", passed by reference, into which the keywords are stored.
// weights - An "output parameter", passed by reference, into which the weights are stored.
void readKeywordWeights(istream &input, vector<string> &keywords, vector<double> &weights) {
// TODO: Write an implementation for this function!
}
Algorithm
This function is passed a filestream connected to a text file, an empty vector of string
variables, and an empty vector of double
variables. The vector of strings
gets “filled up” by the words in the text file, and the vector of doubles
gets “filled up” by the numbers in the text file.
The Homework assignments include two common patterns that are particularly applicable for this function:
- Strings, Streams, and I/O - Reading In Multiple Pieces of Data
- Vectors - “Fill As You Go”
Review the examples in the Homework for these patterns and consider how to adapt the examples for what you need to do here.
The readReview
Function
This is a function that will read in words from a review text file and store each word in the text file to a vector of string
variables.
Description
// Reads in a review from an input stream and stores each individual word from the review
// into the vector 'reviewWords', in the same order they appeared in the input.
// PARAMETERS:
// input - An input stream from which the review is read. For this project, we assume the
// input stream is a file input stream, with words separated by whitespace.and weights are read.
// reviewWords - An "output parameter", passed by reference, into which the review words are stored.
void readReview(istream &input, vector<string> &reviewWords) {
// TODO: Write an implementation for this function!
}
Algorithm
This function passes in a filestream connected to a text file, and it passes in an empty vector of string
variables. The vector of strings
gets “filled up” by the words in the text file.
The Homework assignments include two common patterns that are particularly applicable for this function:
- Strings, Streams, and I/O - Reading Until the End
- Vectors - “Fill As You Go”
Review the examples in the Homework for these patterns and consider how to adapt the examples for what you need to do here.
The wordWeight
Function
This function determines the weight of a word in one of the hotel reviews. If the word matches one of the keywords, then the function returns the value of the keyword’s weight; otherwise, the function returns 0.0
.
Description
// Returns the weight of a given word by looking it up in the provided vectors.
// The keywords and their corresponding weights are provided as vector parameters.
// It is assumed that these are parallel vectors, so that weights[i] is the weight of keywords[i].
// If a word does not appear in the keywords vector, its weight is zero.
// PARAMETERS:
// word - The word to be looked up
// keywords - A vector containing all keywords.
// weights - A vector containing weights corresponding to each keyword.
double wordWeight(const string &word, const vector<string> &keywords, const vector<double> &weights) {
// TODO: Write an implementation for this function!
}
Algorithm
This is a function that is passed one word from the vector that represents a hotel review. The function is also passed the parallel vectors that contain the keywords and their corresponding weights.
The function checks the review’s word against each of the keywords to see if they match. If a match is found, the function returns the weight of the keyword that matches the review word. If no match is found, the function should return 0.0
for the word’s weight.
The Homework assignments include two common patterns that are particularly applicable for this function:
- Vectors - Searching for a Value
- Vectors - Accessing Parallel Vectors
Review the examples in the Homework for these patterns and consider how to adapt the examples for what you need to do here.
The reviewScore
Function
This function calculates the score of a review. The review’s score is the sum of the weights of all the words in the review, so the wordWeight
function will be helpful here!
Description
// Computes and returns the overall score for a review. This is the sum of the weights of
// the individual words in the review. Note that a word may appear more than once in the review,
// and if this happens it's weight is added in multiple times as well. The keywords and their
// corresponding weights are provided as vector parameters. It is assumed that these are parallel
// vectors, so that weights[i] is the weight of keywords[i]. If a word does not appear in the
// keywords vector, its weight is zero.
// HINT: Make a copy of the reviewWords vector using a separate variable. Then, call the
// preprocessReview() function on the copy. Having a preprocessed copy of the words
// will allow you to compare against the keywords.
// PARAMETERS:
// reviewWords - A vector containing the individual words in the review.
// keywords - A vector containing all keywords.
// weights - A vector containing weights corresponding to each keyword.
double reviewScore(const vector<string> &reviewWords, const vector<string> &keywords, const vector<double> &weights) {
// TODO: Write an implementation for this function!
}
Algorithm
This function is passed three parameters:
- a vector of
strings
representing a review, - a vector of
strings
representing the keywords, and - a vector of
doubles
representing the keyword weights.
- Make a copy of the review to work with in this function, so that you keep the original version to potentially work on later.
- Preprocess the copy of the review to standardize the text and make it easier to search for keywords. Important! Call the
preprocessReview
helper function already written for you to do the preprocessing. Look at the description ofpreprocessReview
to understand how to call it and what it does. - Iterate through each word of the preprocessed review, get its weight using the
wordWeight
function, and add the word’s weight to a running total score. - After all words are processed, the function returns the total score.
For an example of the process of scoring the review, refer to the keyword search process described earlier. Note that in this case, words which are not identified as keywords will simply have a weight of 0.0
returned from wordWeight
, so that it is safe to add them in without affecting the overall score.
The Homework assignments include a common pattern that is particularly applicable for this function:
- Vectors - Using an Accumulator
Review the example in the Homework Assignment for this pattern and consider how to adapt the example for what you need to do here.
Testing Your Functions
When working with complex programs made up of several different functions, it’s important to be able to test each function individually to make sure it is working correctly on its own. This is called unit testing. A strategy for implementing a set of unit tests is to write a separate main
function (in a different file) that intentionally calls each function one at a time on a variety of inputs and confirms that the outputs from the functions match the expected correct answer.
A few sample unit tests can be found in unit_tests.cpp
file provided with the project. The comments included in that file describe the way the unit testing process works. We highly encourage you to use these samples as a starting point and write additional unit tests of your own.
In order to compile and run the unit tests, use the following commands:
g++ -std=c++11 -Wall -pedantic unit_tests.cpp reviews.cpp -o unit_tests
./unit_tests
Note the difference from the compilation command for the regular program. We’ve basically kept all the review functions from reviews.cpp
, but we’ve swapped in unit_tests.cpp
for evaluateReviews.cpp
, which means the main
function containing the tests will be used instead.
You do not need to turn in the unit_tests.cpp
file to the autograder.
Task 2: Driver Program
The second task in this project is to write a driver program that will evaluate the hotel reviews as truthful, deceptive, or uncategorized, and create a summary report of the analysis. The driver program will use the functions in the Reviews library to:
- Read in the keywords and weights from a file
- Read in and evaluate hotel reviews from several different files
- Write a summary report to
report.txt
Description of evaluateReviews.cpp
The driver program is written in the evaluateReviews_starter.cpp
file. Don’t forget to remove _starter
from the filename before you try to compile with this file!
// Add any #includes for C++ libraries here.
// We have already included iostream as an example.
#include <iostream>
// The #include adds all the function declarations (a.k.a. prototypes) from the
// reviews.h file, which means the compiler knows about them when it is compiling
// the main function below (e.g. it can verify the parameter types and return types
// of the function declarations match the way those functions are used in main() ).
// However, the #include does not add the actual code for the functions, which is
// in reviews.cpp. This means you need to compile with a g++ command including both
// .cpp source files. For this project, we will being using some features from C++11,
// which requires an additional flag. Compile with this command:
// g++ --std=c++11 evaluateReviews.cpp reviews.cpp -o evaluateReviews
#include "reviews.h"
using namespace std;
const double SCORE_LIMIT_TRUTHFUL = 3;
const double SCORE_LIMIT_DECEPTIVE = -3;
int main(){
// TODO: implement the main program
}
Algorithm
This is the general algorithm for the driver program (the main
function in evaluateReviews.cpp
):
- Open a file input stream for the
keywordWeights.txt
file.- If the file cannot be opened, print
"Error: keywordWeights.txt could not be opened."
tocout
- Use
return 1;
to exit themain
function (recall that a nonzero return value from main reports an error).
- If the file cannot be opened, print
- If the keyword weights file was opened, read the keywords and their weights into parallel vectors. (Which Reviews library function would be helpful here?)
- For each hotel review,
- Create the filename (e.g.
review00.txt
) (Which Reviews library function would be helpful here?) - Open a filestream to the file
- Read each word of the review into a vector of
string
variables (Which Reviews library function would be helpful here?) - Calculate the review’s score (Which Reviews library function would be helpful here?)
- Determine the review’s category:
- truthful: score > 3.0
- deceptive: score < -3.0
- uncategorized: otherwise
- Track the review with the highest score and the review with the lowest score
- Create the filename (e.g.
- Write out a summary of the truthfulness and deceptiveness of the reviews to a file named
report.txt
- Print
"Program complete. Check report.txt file for summary."
tocout
to indicate that the program has finished. Do not print the contents ofreport.txt
.- Note: There should be two spaces after
Program complete.
not just one. Your browser may “collapse” these two spaces into one space when you view these specs.
- Note: There should be two spaces after
You know that you have processed all the reviews once you try to open a file input stream for the next review file and it does not open successfully (because the file doesn’t exist!). You don’t need to print an error message in this case - simply have your program stop trying to read more reviews.
You are guaranteed that there will always be at least one review and that there will be no gaps in the numbering of the hotel reviews. There will be no more than 100 hotel reviews.
Make sure your summary of the reviews is being written to the report.txt
file, not cout
.
You may write additional helper functions in evaluateReviews.cpp
! (See the Additional Helper Functions section for some suggestions.) However, do not add additional functions to reviews.cpp
that you intend to use in evaluateReviews.cpp
, since this would require modifying the review.h
file to contain their prototypes as well, which is prohibited for this project. Instead, put your own additional helper functions into the evaluateReviews.cpp
file.
Details of the Summary Report
There is one output file for this program: report.txt
. This file contains a summary report of your hotel review analysis. The summary report should contain:
- A header line, reading exactly
review score category
- A line of information for each report including the following (separated by spaces)
- Review number (0, 1, 2, etc.)
- The overall score of the review
- The categorization (truthful, deceptive, or uncategorized)
- [an extra blank line]
- The total number of reviews analyzed
- The total number of truthful reviews
- The total number of deceptive reviews
- The total number of uncategorized reviews
- [an extra blank line]
- The number of the review with the highest score (You may assume there are no “ties”.)
- The number of the review with the lowest score (You may assume there are no “ties”.)
Test Case for Evaluating Reviews
Test cases are very important for this project. You should first use unit tests for your helper functions so you know they are working correctly. Once you have verified that your helper functions work correctly, continue developing your code to eventually create the report.txt
file that summarizes the analysis.
You are provided with 20 hotel reviews as test data. Reviews 0-9 are known to be truthful and reviews 10-19 are known to be deceptive; although, your algorithm won’t be able to correctly categorize ALL of the reviews. The sample_report.txt
file contains the correct output for your report.txt
file. Make sure you can recreate what is in sample_report.txt
exactly!
review score category
0 3.49 truthful
1 3.33 truthful
2 15.68 truthful
3 4.43 truthful
4 4.14 truthful
5 11.29 truthful
6 20.61 truthful
7 -2.89 uncategorized
8 2.71 uncategorized
9 11.93 truthful
10 0.03 uncategorized
11 -13.06 deceptive
12 -3.66 deceptive
13 -8.46 deceptive
14 5.18 truthful
15 -18.68 deceptive
16 -17.88 deceptive
17 -21.61 deceptive
18 -11.25 deceptive
19 -15.08 deceptive
Number of reviews: 20
Number of truthful reviews: 9
Number of deceptive reviews: 8
Number of uncategorized reviews: 3
Review with highest score: 6
Review with lowest score: 17
FAQs
Here are frequently asked questions about this project. If you are stuck on something, start here!
General FAQs
How do I read multiple different data types from a file?
>>
) reads information up until it encounters whitespace (spaces, tabs, new lines) and then stores that information into the variable you specify. So if there is a space in between different types of data, we can use >>
as many times as we need to read in each piece of data and store it in the appropriate variable as long as the variables are declared as the correct types beforehand.
For example, if we have a file named
apples.dat
with the data:
4 apples 425 gramsWe can read in each piece of information by declaring a variable to hold each piece of information and then using
>>
a bunch of times to read each piece of information into the appropriate variable:
// declare variables to hold data int num; string fruit_type; double mass; string units; // open an input filestream ifstream apple_data("apples.dat"); // read each piece of data, in the order it appears in the file apple_data >> num >> fruit_type >> mass >> units;You can also refer to Homework 16.11 for another example of this approach.
What does everyone mean by "inserting a cout
statement" to debug something?
cout
statement (along with any useful context) can help you figure out exactly where / why things are going wrong in your code. For example, if your program is stopping unexpectedly and you're not sure why, you can put "checkpoint" cout
statements in your code like this:
// some code here cout << "Checkpoint 1" << endl; // some more code cout << "Checkpoint 2" << endl; // even more codeThis way, if you run your code, and you only see
Checkpoint 1
in your terminal, you know the issue is somewhere in the lines of code between between those checkpoints. This helps you narrow down where any possible issues could be. This is very helpful with making sure if
statements or loops run correctly!
See the lectures on debugging for more examples of using
cout
statements to debug errors.
My test cases aren't working when there are less than 100 reviews but the "all review files" test cases are correct. What am I doing wrong?
My max and min are off for some test cases on the autograder.
My output is the same, but autograder gives me a 0.
- Check for common misspellings (e.g. "reviews" misspelled as "reveiws" and other misspellings)
- Check for correct whitespace around words (spaces, newlines, etc.). You can toggle the Autograder output to show the whitespace characters to help with this.
- Check for the correct return value from your
main()
function. Remember that if a required file cannot be opened, you should be returning a value of1
(to indicate an error so the user knows something is wrong). - Double check that you are submitting the correct file! Sometimes, students create a duplicate by accident and submit the duplicate.
My code doesn't compile because of an 'undefined reference'
or 'linker error'
.cpp
files.
Both the
evaluateReviews.cpp
and reviews.cpp
files need to include the reviews.h
header file so that the two files can be used together. So, both files need to have this line of code near the top of the file where the other #include
statements are:
#include reviews.hThe compiler also needs to know about both the
evaluateReviews.cpp
file and the reviews.cpp
file, since the the reviews.cpp
file is where your helper functions are. So you should be compiling with this line at the terminal:
g++ --std=c++11 evaluateReviews.cpp reviews.cpp -o evaluateReviews
Refer to the project specifications for more details on why you need to do this (and for the command to compile your unit tests).
I'm getting an out_of_range
error or a segmentation fault (aka seg fault)!
.at()
for your indexing, then .at()
is telling you that you are indexing out of range. If you are using []
for your indexing, it can be a little harder to find track down where the indexing out of range is happening, but you can still do it. In general, you likely have one of these situations going on somewhere in your code:
- You could have an index variable that is negative. For example, if
i
is-1
and then you have a statement with a vector that includes.at(i)
, that would cause this error. - You could have an index variable that is
0
, but the vector is empty. For example, ifi
is0
and then you have a statement with an empty vector that includes.at(i)
, that would cause this error. - You could have an index value that is too high. For example, if
i
is23
but your vector only has 19 elements, and you have a statement with this vector that includes.at(i)
, that would cause this error.
cout
statements with messages like, "About to call helper function XXX...."
or "This is line XX"
. After you save, compile, and run your program, you will see your messages print to the terminal. Eventually, you will see that "out of range" error. You can look to see what was the last message to print, and then you know that your out of range indexing is somewhere between the last message that printed and the next message (that did NOT print).
My code compiles with no errors and the executable is built, but when I run the executable the report.txt
file is not generated.
return
(or return 1;
) statement somewhere it is not supposed to be. You should only have return 1;
if the keywords file or the weights file cannot be opened.
You can also add some
cout
statements to find how far into your program you get before the program ends; that will help you narrow down where things are going wrong.
Task 1 FAQs
What does it mean by "Make a copy of the review to work within this function?" Am I supposed to make a copy of the review within the function? Or are we supposed to make a physical copy of the review file on our computer to test the function?
preprocess
function on the "copy of the review" (the new vector you made) without losing access to the original vector.
Why do I need to make a copy of the review in reviewScore
?
reviewScore
, paying attention to the parameter types. All of the parameters are vectors and are being passed by reference. If we preprocess the original review directly, instead of making a copy, it will permanently change the review in your driver program as well. What problems could this cause later on?
My wordWeight
function always returns 0
Task 2 FAQs
How do you know how many times to loop if we don’t know how many files we will need to read?
for
loop that loops 100 times. We will need to break out of the loop if a file does not exist. For example, if there are 20 reviews, then the 21st review (which would be file review20.txt
) does not exist. So we can have an if
statement that checks whether the file did not open correctly and, if so, call break
to exit out of the loop early.
See Homework 14.6 for an example of using
break
to exit a loop early. See the Redact Info program in the Project 3 Overview lecture for an example of checking to see if a file could be opened.
I'm getting all zeros for the review scores.
.cpp
files so that your program can find those files and open them. This is a common reason for your program running but behaving incorrectly.
Other reasons include: out of scope variables, accidentally resetting your variables to 0, and not including the correct header files.
My scores are always increasing for some reason.
I never enter the loop to check all the reviews.
.cpp
files, so your progam can't find the review files in order to open them. First, make sure that that you have unzipped/extracted the reviews from the .zip
folder. Then, make sure that all 20 review files (review00.txt
-- review19.txt
) in the same folder as your .cpp
files. If you have done this correctly, you will see the review files in the same folder as your .cpp
files. You should not have them in a separate folder. See the comparison below:
GOOD: The review files are in the same folder as the
.cpp
files.
BAD: The review files are in a subfolder called
reviewFiles
that is within the folder where the .cpp
files. If your workspace looks like this, move the review files out of the subfolder and into the main folder with your Project 3 code.
My program runs but I just don't get correct behavior / values.
.cpp
files so that your program can find those files and open them. This is a common reason for your program running but behaving incorrectly. Also, read the other FAQs about getting incorrect values.
How do I know when I've read all the reviews / know when to stop?
break
function if a condition is met.