quantitative data analysis Archives - GeoPoll

Quantitative Data Analysis

Roxana Elliott — Thu, 21 Jan 2021 15:00:03 +0000

Once quantitative data has been gathered and cleaned, the next step in the research process is to analyze the data in order to glean insights from it. This step is crucial as data must be analyzed properly before a researcher can understand which findings are significant and report on the findings or make a judgment on their hypothesis. If data is not analyzed with care, findings may be misrepresented, which can lead to decisions being made upon statistics that did not accurately represent the entire dataset.

For example, one might use an average to represent a fact such as the amount customers are willing to pay for ice cream. However, if 95% of respondents stated that they would spend $5 or less on a pint of ice cream, and 1% of respondents stated that they would spend $100 on ice cream, an average would be skewed by the 1% who would spend much more. In this case, a researcher may decide that a different statistic, such as the median, would more accurately represent the findings. Making these judgments is an important step in the quantitative data analysis process, as are ensuring that data is properly cleaned and coded prior to analysis.

Quantitative Analysis Methods

Quantitative data is analyzed using statistical methods, as quantitative data represents numbers from which datapoints can be calculated. Data from a quantitative dataset, such as survey results, is usually loaded into a program such as Excel or the statistics software SPSS which enables researchers to quickly create tables and charts in order to examine findings. Often the first step in analyzing a dataset is to view top-level findings using descriptive statistics such as mean, median, and mode.

Descriptive Statistics

In the below definitions, we will use the example of a survey with 400 respondents who were asked to rate their opinion of chocolate ice cream on a scale of 1 ‘strongly dislike’ to 5 ‘strongly like’. The data indicated that 100 rated an ice cream flavor a ‘5’, 200 rated it a ‘4’, and 100 rated it a ‘3’.

Mean or average: The numerical average of a set of numbers.
- In the above example, the average rating would be ((5×100)+(4×200)+(3×100))/400= 4
Median: The median is the midpoint in a set of numbers.
- In the above example, the median would be the number in the 200^th row of data. In this case it would be 4, but depending on the dataset, the median can be different from the average.
Mode: The number that occurs the most often in a dataset.
- In the above example, this would also be 4 as it occurred 200 times, while 5 and 3 only occur 100 times each.
Range: A statement that represents the lowest and highest numbers in a dataset.
- In the above example, the range would be from 3 to 5.
Distribution or Percentage: The percent represented by each category within a group, out of the total (100%).
- In the above example, instead of looking at the dataset as a whole this would report that ‘25% rated the ice cream a ‘5’, 50% rated it a ‘4’, and 25% rated it a ‘3’

Cross Tabulations

After examining descriptive statistics, researchers may use cross-tabulations to dig deeper into a dataset. A cross tabulation or crosstab is a way to show the relationship between two variables and is often used to compare results by demographic groups. For the above example, we could create crosstabs to show results by age:

Crosstabs can also be created to examine one datapoint by another, such as if those who rate chocolate ice cream highly also rate vanilla ice cream highly, or if there is a different relationship between the two variables. Crosstabs are useful to better understand the nuances of a dataset and the factors that may influence a datapoint.

Calculating Statistical Significance

When researchers are looking to prove or disprove hypotheses, they will often also use measures to calculate the statistical significance of their findings. Measures of statistical significance demonstrate if a finding is merely due to chance or if it is a significant finding that should be reported on. In the above example, without calculating statistical significance we cannot be sure if the difference in results between those aged 18-24 and 25-34 is due to the difference in age groups, or if the findings are a coincidence based on the sample that was selected and not related to age.

Statistical significance is usually represented by a statistic called a p-value. A p-value is a calculated number between 0 and 1, and the lower the p-value is, the less likely it is that the results were due only to chance. Typically, a p-value of less than 0.05 is regarded as statically significant, as it means there is a less than 5% likelihood that the results were due to chance. While having a p-value of under 0.05 doesn’t necessarily mean that the stated hypothesis is true, it decreases the chances that any differences in the dataset are occurring by chance. Researchers who are running tests to make decisions, for example to determine if populations prefer vanilla or chocolate ice cream in order to make purchasing decisions, should use a test of significance in order to have more confidence in their decision making.

Programs including Excel, R and SPSS can calculate the significance of findings through a series of steps, outlined in more detail here. If you work with a full-service research agency such as GeoPoll, we can run statistical significance tests for you and include the resulting data in our data analysis.

Conduct Quantitative Data Analysis with GeoPoll

GeoPoll is a research company that gathers data for international organizations, governments, consumer brands, and media houses which enables better decision making. Our services range from study and questionnaire design to data analysis, including the creation of data tables, crosstabs, and full research reports. To learn more about our capabilities or get a quote for your next project, please contact us.

The post Quantitative Data Analysis appeared first on GeoPoll.

Data Cleaning: Steps to Clean Data

BenardOkasi — Tue, 05 Jan 2021 18:37:04 +0000

Following the collection of data through a survey or other research method, data must be cleaned. The data cleaning process, also known as data scrubbing or data cleansing, can have a huge impact on the reliability and validity of your final data, as it ensures that you are only using the highest-quality data to perform your analysis. By rushing or eliminating the data cleaning step, you run the risk of including false, misleading, or duplicated records in your final dataset. Following a thorough data cleaning process will minimize errors made due to data that is formatted incorrectly.

Steps to Clean Data

The steps to clean a dataset may vary slightly depending on the research methodology, and if the resulting data is largely quantitative or qualitative. However, the below represent some of the most commonly used steps in the data cleaning process.

1. Remove Duplicate and Incomplete Cases:

Datasets may sometimes include duplicate cases if a respondent accidentally took a survey twice, data were combined from multiple sources, or there was an error when retrieving the dataset. Depending on the data collection tool you use, it is also possible that an initial dataset includes incomplete cases, for example, if a survey respondent took only half of the questions. The first step in data cleaning is to remove any duplicate or incomplete cases so that you are examining a set of unique and complete cases.

2. Remove Oversample:

In many cases, particularly when conducting survey research, a researcher may collect more responses than they need. For example, you may be aiming to gather 500 completed survey responses, of which 250 identify as female and 250 identify as male, but end up gathering 700 completed responses, 300 who are female and 400 who are male. As you have 50 extra female completes and 150 additional male completes, you will need to cut back the data so that the sample is equally representative of male and female respondents. Researchers should use a randomized method to remove any oversample in order to meet the sample requirements so that each respondent has an equal chance of being included or excluded in the final dataset.

3. Ensure Answers are Formatted Correctly:

Raw data may come in several different formats when it is first accessed – for example, a multiple-choice survey question about ice cream flavors may list answers as numbers 1, 2, and 3 when in the question they represented text choices Vanilla, Chocolate, and Strawberry. Depending on how the data will be analyzed, researchers may want to replace the numerical data with the textual data. If data is being combined from multiple surveys or data sources, there could be two different words used to represent the same thing – for example ‘Not sure’ vs ‘Unsure’. To avoid these cases being represented as unique answers, they should be combined.

4. Remove Nonsense Answers and Unreadable Data:

Datasets may include nonsense answers, such as those including symbols or other words or numbers which do not make sense in the context of the question or field. It is also possible when importing data from multiple files that some data may be unreadable. These cases should be removed from the final dataset.

5. Identify and Review Outliers:

Outliers are data points that lie outside the majority of responses and should be carefully reviewed and validated before being included in the final dataset. While some outliers may be valid, such as one respondent stating they have 10 siblings when the majority of respondents have 2 or 3 siblings, others may demonstrate that the respondent did not understand a question or is falsifying responses, such as an answer which indicates a respondent would pay $1000 for a pint of ice cream. Outliers that seem out of the reasonable realm of possibility should be removed from a dataset as they will skew key calculations.

Step 6. Code Open Ended Data

The coding of open-ended response data is an entire process unto itself, however, it is also an important part of data cleaning. Datasets that include open-ended data can be particularly time-consuming to clean as data can be lengthy, unrelated to the question at hand, or hard to decipher. In order to glean statistical insights from qualitative responses, open-ended responses may be coded into categories, a process that involves first reviewing all responses manually to create categories, and then going through open-ended data and actually placing it into the categories. If the dataset is in multiple languages, this step may also include translating responses into the language the analysis will be conducted in.

7. Check Data Consistency:

Check on logic relations and ensure there are correlated data sets, and there are no inconsistencies including contradictions and gaps in data. In case of any inconsistencies between the data and the questionnaire, the issues should be flagged in order for you to decide a way forward or if data should be excluded from the final dataset.

8. Perform Final Quality Assurance Checks:

After you have gone through the above steps, researchers will still want to perform manual quality assurance checks of the data before starting data analysis. This final step should examine the dataset in its entirety, looking to see if there are anomalies in data for any individual question or data point and double-checking that data is formatted correctly.

Cleaning Survey Data

Some of the above steps can be performed by data cleaning tools including SAS and R software, however manual oversight is required to ensure no errors or inconsistencies in a dataset are missed. GeoPoll prides itself on delivering high-quality and accurate data from our mobile-based surveys and other research methods. Our research team performs data cleaning and coding on each of our datasets before they are delivered to clients. To learn more or contact a member of our team, please contact us.

The post Data Cleaning: Steps to Clean Data appeared first on GeoPoll.