USC Health Data Lab

Data Terms Glossary

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

A

Aggregation is the process of combining and summarizing large sets of data into a more compact and manageable form. This involves grouping individual data points or values together based on certain attributes or criteria and then calculating summary statistics for these groups, such as sums, averages, counts, or percentages. Aggregating data helps reveal patterns, trends, and insights that might not be immediately evident in individual data points. It is often used for creating reports, visualizations, and dashboards that provide an overview of trends and performance.

B

Binary refers to a numbering system based on two digits, 0 and 1. It is a fundamental system used in digital computing, where data and instructions are represented using sequences of binary digits, also known as bits. Some columns in data sets use 0 or 1 to indicate a response to the column title. For example, a column titled “Is current patient?” would have rows with a value of 0 to signify that the person or value in the row is not a current patient and a 1 to indicate a current patient.

In data analysis and statistics, bins refer to the intervals or ranges into which data is grouped or divided to organize, summarize, and visualize. Binning involves categorizing numerical data into specific groups or intervals to simplify data representation and analysis. Bins are used in various data visualization techniques, such as histograms or bar charts, to display the distribution of data. Each bin represents a specific range of values, and the frequency or count of data points falling within that range is represented by the height or length of the corresponding bar in the chart. Too few bins may result in oversimplification, while too many bins may lead to excessive detail in the visualization. Bins allow for more straightforward interpretation and comparison across different categories or intervals.

A Boolean variable is a type of variable in programming that can hold only two possible values: true or false. It is used to represent logical or binary states, such as on/off, yes/no, or true/false conditions. Boolean variables play a crucial role in decision-making within programs. They are commonly used in “IF” statements, where a software program evaluates a condition and, based on whether it’s true or false, takes different actions or follows different paths. Similarly, Boolean variables are essential in “WHILE” loops, allowing programs to repeat certain steps or actions until a specific condition becomes false.

Business intelligence (BI) is a set of technologies and strategies that involve collecting, analyzing, presenting, and visualizing data to support and inform decision-making and business processes within an organization. BI aims to transform raw data into meaningful insights, enabling stakeholders to better understand their business performance, opportunities, and challenges and make data-driven decisions.

C

Charts and graphs are graphical representations that visually display data using symbols, shapes, and colors to communicate information and patterns. The terms “graph” and “chart” are often used interchangeably. They are used to simplify complex datasets, making it easier to understand and interpret data by presenting it in a visually appealing and informative manner. They both come in different forms, such as bar charts, pie charts, line graphs, scatter plots, and more, each tailored to represent specific aspects of the data. The use of charts and graphs in data analytics aims to make data more accessible and understandable for a wide range of audiences, from experts to non-experts. By visualizing data, charts facilitate quick comprehension of key insights, trends, and distributions, aiding in effective decision-making and analysis.

A CSV (Comma-Separated Values) file is a plain text file format, or a file that only contains text, used to store and exchange tabular data. It is a simple and widely supported file format that organizes data into rows and columns, with each line representing a row and each value separated by a comma or other delimiter. In a CSV file, the first row often contains the column names or headers, while subsequent rows have the corresponding data values. Each value is typically enclosed in quotation marks if it contains special characters, such as commas or line breaks. CSV files are human-readable and can be easily opened and edited with spreadsheet software or text editors. CSV files are often used for tasks such as data import/export, data migration, and data integration, making them a common choice for exchanging and transferring data in a simple and standardized format.

D

A dashboard is a visual representation of data that provides a consolidated and customizable view of key information, metrics, and performance indicators in a user-friendly and easily digestible format. It presents relevant data points, charts, graphs, tables, and other visual elements on a single screen or interface, allowing users to monitor and analyze data at a glance. Dashboards can be interactive, allowing users to drill down into details, apply filters, or manipulate the displayed data to gain deeper insights. The purpose of a dashboard is to facilitate data-driven decision-making by presenting complex and comprehensive information in a visually appealing and intuitive manner. It enables users to monitor performance, identify trends, spot anomalies, and make informed decisions based on the displayed data.

Data analytics is the process of examining and interpreting large sets of data to uncover meaningful patterns, insights, and trends. Information is extracted from raw data and can be processed through data cleaning, transforming, and modeling to ensure the accuracy and reliability of the analysis. In healthcare, data analytics can be utilized to track patient care and outcomes, the operational efficiency of hospitals and health systems, and various health research efforts, which can then inform business and care decisions.

Data architecture refers to the design, structure, and organization of data assets within an organization or system. It encompasses the models, policies, standards, and technologies that govern how data is collected, stored, processed, and utilized to meet the needs of an organization. Data architecture involves strategic decisions about data representation, integration, security, and accessibility. It establishes the framework for managing and governing data throughout its lifecycle, from creation or acquisition to archival or disposal. It provides a blueprint for how data flows and is transformed, enabling efficient decision-making, analytics, and insights generation. A well-designed data architecture promotes data consistency, accuracy, and reliability while supporting the evolving needs of an organization.

A database management system (DBMS) is a system or software that enables database creation, organization, retrieval, modification, and administration. It provides a structured and efficient approach to storing, managing, and accessing large volumes of data in a secure and scalable manner. It facilitates the definition of the database structure, including tables, relationships, and constraints, and allows users to interact with the data through a query language such as SQL. Crucial functions of a DBMS include data storage, retrieval, and manipulation. It provides data integrity and security mechanisms, ensuring only authorized users can access and modify the data.

Data capture refers to the process of collecting, recording, and acquiring data from various sources and inputting it into a computer system or database for further processing, storage, and/or analysis. It can involve manual data entry by individuals such as patient surveys, automated processes that extract data from electronic documents or systems such as electronic health records, or real-time data capture from sensors like remote health monitoring devices. Data capture is a crucial step in the data lifecycle as it serves as the foundation for subsequent data processing and analysis. Accurate and timely data capture ensures that relevant information

A data catalog is a centralized repository or database that provides an organized and searchable inventory of data assets within an organization. It describes and documents the available data sources, datasets, tables, files, and associated metadata. Data catalogs are valuable tools for data management, data governance, and self-service analytics within organizations.

Data cleaning, also known as data cleansing or data scrubbing, refers to the process of identifying, correcting, or removing errors, inconsistencies, and inaccuracies in a dataset. It involves transforming raw data into a clean, reliable, and consistent form for further analysis or use. Data cleaning aims to improve the quality and integrity of the data by addressing issues such as missing values, duplicate records, outliers, formatting errors, and inconsistent or incorrect values. The process typically involves several steps, including validating data, handling missing data, removing duplicate records, detecting outliers, standardizing and formatting, and correcting inconsistencies or errors. Clean data provides a solid foundation for accurate analysis, modeling, and decision-making in various fields, including business, research, and data science.

A data construct refers to a concept or idea that is not directly measurable but is inferred or derived from measurable variables. Data constructs are theoretical or abstract concepts that researchers aim to understand or explain through the analysis of observable data. For example, if a study aims to assess the abstract concept of patient satisfaction, patient responses to survey questions asking them to rank satisfaction on a scale from 1 to 10 can be used. Constructs provide a way to simplify complex ideas and make them quantifiable, allowing researchers to apply statistical techniques and analyses to draw conclusions and insights. Construct validity refers to the extent to which a construct accurately measures the intended theoretical concept, ensuring the credibility of research findings.

Data manipulation refers to the process of modifying, transforming, or reorganizing data to meet specific requirements or objectives. It involves performing operations on data to extract, filter, aggregate, calculate, merge, or reshape it in order to derive new insights, generate meaningful information, or prepare it for further analysis or presentation. Data manipulation can involve various tasks and techniques, including filtering and selection, sorting and ordering, aggregating and summarizing, joining and merging, transforming, and restructuring the data as necessary.

Data mapping refers to the process of establishing a relationship or connection between data elements in different data sources or systems. It involves identifying corresponding data fields or attributes between the source and target systems and defining how the data will be transformed and transferred during data integration or data migration processes. Data mapping focuses on the technical aspects of data transformation and alignment between different systems or data formats. It is a vital process that facilitates smooth data integration and migration, allowing data to flow seamlessly between different systems and assisting organizations in making well-informed decisions and deriving meaningful insights from their data.

Data modeling is the process of creating a conceptual, logical, or physical representation of data and its relationships within a specific domain or system. It involves designing a blueprint or schema that identifies entities, attributes, relationships, and constraints of and within the data. Data modeling provides a framework for organizing and documenting the data, enabling effective data management, integration, analysis, and visualization.

Data science is an interdisciplinary field that combines scientific methods, processes, algorithms, and systems to extract insights, knowledge, and actionable information from data. It involves collecting, organizing, analyzing, interpreting, and visualizing large and complex datasets to uncover patterns, trends, and correlations that can drive informed decision-making and solve problems across various domains. Data science incorporates techniques from multiple disciplines, such as statistics, mathematics, computer science, and domain expertise. It encompasses the entire data lifecycle, including data acquisition, data cleaning, data integration, data modeling, and data visualization. Data scientists leverage programming languages like SQL and R and tools like machine learning to develop and deploy predictive models, statistical analyses, and data-driven solutions. Data science aims to discover meaningful patterns, trends, and relationships hidden within the data to generate valuable insights and knowledge. Data scientists build statistical models and machine learning algorithms to make predictions, forecasts, and classifications based on historical data and patterns. Data science provides evidence-based support for decision-making by utilizing data-driven insights to guide strategic and operational choices. Data science is crucial in developing data-driven solutions, such as recommendation systems, fraud detection, risk assessment, personalized medicine, and process optimization.

Data scrubbing, also known as data cleansing or data cleaning, refers to the process of identifying and rectifying or removing errors, inconsistencies, and inaccuracies from a dataset. It involves detecting and correcting data anomalies, such as missing values, duplicates, formatting errors, and outliers, to improve data quality and integrity. For example, if two identical records exist for the same patient, data scrubbing would detect and remove the duplicate, leaving only one accurate entry.

Data validation is the process of verifying data accuracy, integrity, and consistency meets specific quality standards. It involves checking for errors, anomalies, and discrepancies by applying predefined rules, business logic, and cross-field validations. This ensures data completeness, adherence to defined criteria, and consistency across multiple data elements. Data validation is crucial in maintaining reliable and trustworthy data, enabling informed decision-making and preventing errors caused by inaccurate or inconsistent data. By identifying and resolving issues, organizations can improve data accuracy, enhance data integrity, and ensure data quality and reliability throughout the data lifecycle.

Data visualization is the representation of data through visual elements like graphs and charts. It transforms complex data into easily understandable visuals to reveal patterns and insights. Effective visualization involves choosing appropriate formats and designing visuals that highlight key information. It simplifies complex concepts and aids in informed decision-making across various fields. Ultimately, data visualization enhances data exploration, trend identification, and communication of findings.

A data warehouse is a specialized database that integrates and stores data from multiple sources in a unified and structured manner. It integrates and consolidates data from different systems, transforms it into a consistent format, and stores the data to provide a platform for data analysis and decision-making. A data warehouse often employs a dimensional modeling approach, which structures data around key business dimensions (e.g., time, location, product) and measures. It enables complex queries, data analysis, and reporting for business intelligence purposes, allowing users to slice and dice data along different dimensions.

A database schema refers to the logical structure or blueprint that defines the organization, relationships, and attributes of a database. It defines the tables, fields, data types, constraints, and other elements that determine how data is stored and organized within the database. The database schema provides a framework for creating, managing, and accessing data in a consistent and structured manner, ensuring data integrity and facilitating efficient data operations.

A database is a structured collection of data that is organized and stored in a systematic manner. It serves as a central repository for storing and managing information, making it easy to retrieve, update, and analyze data efficiently. Databases are commonly used in various applications and systems to store and organize data for efficient data management and retrieval.

A dataset is a structured collection of data that is organized and stored in a coherent manner. It typically consists of a group of related variables or observations. Datasets can take various forms, such as spreadsheets, databases, or files, and they can contain different types of data, including numerical, categorical, and textual information. A dataset can represent information about a specific topic, research study, business process, or any other subject of interest. It serves as the foundation for data analysis and allows researchers, analysts, and practitioners to explore, manipulate, and draw insights from the data. Datasets vary in size, from small and focused collections to large and complex ones, and they play a crucial role in informing decisions, uncovering patterns, and generating knowledge in various domains.

A delimiter is a character or sequence of characters used to separate or define boundaries between data elements within a text or data file. Delimiters are commonly used when working with structured data, such as CSV files or database records, to indicate where one field or value ends and the next one begins. For example, in a CSV file, a comma (“,”) is commonly used as the delimiter, hence the term “comma-separated values.” Other popular delimiters include tabs (“\t”), semicolons (“;”), colons (“:”), and pipe symbols (“|”). By identifying the delimiter, software can split the text or data into individual elements and assign them to their respective fields or variables. Delimiters enable the structured representation and organization of data, making it easier to process, manipulate, and extract meaningful information from the data.

Dummy data, also known as synthetic data or simulated data, refers to artificially generated data that mimics real-world data but does not contain actual or sensitive information. It is used for various purposes, such as software testing, data modeling, and algorithm development. Dummy data is typically created to simulate realistic scenarios or patterns in data, allowing developers and analysts to work with representative datasets without compromising privacy or security. It may include randomly generated values, predefined patterns, or simulated distributions based on the characteristics of real data. While dummy data may not fully reflect the complexities or nuances of actual data, it serves as a useful substitute for scenarios where real-world data cannot be accessed or used.

E

An extract refers to a subset of data taken from a larger dataset or source. This extracted subset typically contains specific rows, columns, or variables relevant to a particular analysis or task. Creating an extract involves selecting and copying the desired data while leaving out the rest. Extracts are commonly used to improve the efficiency of data processing and analysis. By working with a smaller subset of data, computational tasks such as querying, visualizing, and modeling can be performed more quickly, requiring fewer computing resources than working with the entire dataset. Extracts are particularly beneficial when dealing with large datasets that may be too extensive to handle efficiently in their entirety.

I

In data analysis, iterating refers to the process of repeating the steps of data exploration, manipulation, modeling, or interpretation to refine insights, improve accuracy, or develop a deeper understanding of the data. Iteration involves revisiting and adjusting analytical methods, models, or visualizations based on the outcomes obtained from previous iterations.
Iterative data analysis commonly involves the initial analysis, modeling, and identification of relationships within the data, validation of the models, refinement of methods and models, and repetition of the process to verify improvements. Documentation of changes, outcomes, and insights from each iteration is essential for reference and transparency. Overall, iterative data analysis acknowledges that refining analytical methods and models is often an ongoing process. As more insights are garnered and better understanding is cultivated, adjustments are made to enhance the accuracy and relevance of the analysis. This process helps ensure that data-driven conclusions are as reliable and informative as possible.

K

Key Performance Indicators (KPIs) are measurable metrics or parameters used to assess the performance and progress of organizations, activities, or processes. KPIs are selected based on specific goals, objectives, or priorities and are designed to clearly indicate success or achievement. KPIs are typically displayed and monitored on dashboards or performance scorecards, enabling stakeholders to easily track and evaluate progress over time.

L

 Data loading is the process of moving, parsing, and storing the data in a manner that is accessible and compatible with the data storage platform or target destination. This may involve transforming the data and variables into a specific format, validating the data’s integrity, applying data quality checks, and mapping the data to the appropriate structure within the database. The data loading process can vary and may involve using specialized data integration tools.

M

In data analysis and business intelligence, a measure refers to a quantitative value or metric used to evaluate or assess a specific aspect of data. Measures can be derived from raw data or calculated using formulas or aggregations based on the specific requirements of the analysis. They represent quantifiable data points that can be compared, aggregated, and analyzed to derive meaningful conclusions. Measures are often associated with key performance indicators (KPIs) and are used to track and analyze the performance, progress, or characteristics of a business, process, or system.

In data analytics, merging refers to the process of combining two or more datasets based on common attributes or variables. This operation is also known as “joining” in some database systems. Merging datasets involves aligning rows with matching values in specified columns and creating a unified dataset that incorporates information from all the sources. Different types of merges can be performed, such as inner joins (matching only common values), outer joins (including unmatched values), left joins (including all values from the left dataset and matching values from the right), and right joins (similar to left joins but reversing the datasets). The choice of merge type depends on the analytical goals and the relationships between the datasets being merged.

Metadata refers to descriptive information that provides context and details about data. Essentially, it is data about data and serves as a means to understand, organize, and manage the underlying data. Metadata describes various characteristics of data, such as its structure, format, source, quality, and relationships.

Metrics are quantifiable measures that provide numerical insights into performance, progress, or characteristics. They are designed to capture specific information about a process, behavior, outcome, or performance indicator tracked over time, helping organizations understand status and identify areas for improvement. Effective metrics provide actionable insights that drive decision-making and guide improvements.

O

In data analytics, an observation refers to a single unit of data or a specific instance within a dataset. It represents a unique piece of information collected during a study, experiment, or data-gathering process. Each observation contains values or measurements for various variables, which are attributes or characteristics being studied. Observations collectively form the dataset analysts work with to derive insights, identify patterns, and draw conclusions through various analytical methods.

P

Parsing refers to the process of breaking down or analyzing a piece of data, often in a structured format, to extract specific information or attributes from it. Parsing involves interpreting data according to a set of rules or patterns to derive meaning and useful insights. It commonly involves identifying the structure or format of the data, dividing the data into distinct sections or components based on predefined patterns or delimiters, and extracting specific information from each segment. Parsing is crucial when dealing with unstructured or semi-structured data, such as text documents, log files, or HTML code.

Pivoting refers to the process of reorganizing and restructuring data to view it from a different perspective. It involves changing the arrangement of rows and columns in a dataset to create summaries, comparisons, or cross-tabulations that provide new insights or perspectives on the data. The term “pivot” is often used in spreadsheet software and data manipulation tools, such as Excel, which has Pivot Tables. The process typically involves selecting the variables to be used in the new arrangement, performing necessary calculations on the values from the dataset to be placed at the intersections of rows and columns, and generating a new table or visualization that presents the data in the desired pivot format.

Q

Qualitative data, unlike quantitative data, refers to non-numeric information that describes qualities, characteristics, opinions, or attributes. This type of data provides a deeper understanding of experiences, behaviors, and perceptions. Examples of qualitative data include patient feedback, interview responses, and product reviews.

Quantitative data refers to information that is expressed in numerical values or quantities. It involves measurable and objective data that can be analyzed statistically. Examples of quantitative data include sales figures, revenue, age, and the number of patients.

A query is a request or command used to retrieve specific information from a database. It often lists criteria, conditions, and requirements for selecting and extracting data within a table or database. The query returns a subset of the data that matches those specified conditions, often presented in the form of a table or a result set. Queries allow users to interact with the vast amount of data stored in the database, enabling data exploration, reporting, analysis, and other operations necessary to garner information and insights.

R

Raw data refers to unprocessed, unstructured, or unformatted data that has been collected or obtained from various sources without any modifications or transformations. It is the initial, untouched form of data, typically in its most basic and unorganized state. It can come in different formats, such as text files, spreadsheets, sensor readings, or database exports. As the primary output of data collection activities, it may contain errors, inconsistencies, or missing values. It requires further processing and preparation to make it suitable for analysis and decision-making.

Reports are structured documents that present summarized, organized, and visualized data insights and findings. These documents are created to communicate the results of data analysis to various stakeholders, such as decision-makers, managers, or colleagues.
Reports typically include concise descriptions of the dataset, visual elements that show trends and comparisons found in the data, key insights and conclusions from the analysis, actionable recommendations, methodology, and limitations of the data or analysis, all tailored to the audience.

S

A sample refers to a subset of data selected from a larger population for analysis. Sampling is often used when gathering and analyzing data from an entire population is impractical or too time-consuming. A sample is chosen to represent the characteristics of the larger population while being smaller and more manageable. The sampling process involves selecting a representative portion of the data that ideally mirrors the diversity and distribution of the entire dataset. The goal is to ensure that any insights or conclusions drawn from the sample can be reasonably generalized to the whole population. Various sampling techniques, such as random sampling, stratified sampling, and cluster sampling, are employed to reduce bias and ensure the validity of analytical results.

Sk stands for surrogate key. A surrogate key is a unique identifier that is used to identify a record in a database. It is often used when the natural key, or primary key, is not available for use. Examples of natural keys include a patient’s medical record number (MRN), name, or Social Security Number (SSN), which are not always made accessible in datasets due to privacy and confidentiality concerns. Surrogate keys are generated by the database system and have no inherent meaning but are assigned meaning by the data users. For example, a randomly generated number can be assigned to each patient in place of an MRN or name to allow users to analyze unique patient data without having access to sensitive information. Some healthcare organizations use an enterprise master patient index (EMPI) in datasets to allow analysis while providing the required privacy.

SQL, or structured query language, is a standardized programming language designed for managing and manipulating relational databases. It provides a set of commands and syntax rules that allow users to interact with databases, define and manipulate data, and perform various operations such as querying, updating, inserting, and deleting records. SQL is used to communicate with a database management system (DBMS) to store, retrieve, and manipulate data efficiently. It allows users to create and modify database schemas, define tables and relationships, and specify the structure and constraints of the data. Additionally, SQL provides powerful querying capabilities through its SELECT statement, allowing users to retrieve specific data from one or more tables based on specified conditions.

A string variable is a type of variable used in programming that is specifically designed to store and manipulate textual data. It is used to represent sequences of characters, such as words, sentences, or even entire paragraphs.

Syntax refers to the rules that define the structure and format of a programming language or command. It determines how instructions are written and arranged in a valid and meaningful way, encompassing rules regarding the arrangement of keywords, punctuation, operators, variables, data types, and other elements within the programming language. Syntax rules serve as a guide to ensure that instructions are written in a consistent and unambiguous manner, making the code readable and understandable to both humans and machines. It forms the foundation for expressing logic and algorithms correctly and accurately in a programming language

T

A table refers to a structured arrangement of data organized in rows and columns. It is a fundamental way of presenting and storing data in a relational database or a spreadsheet. Each row in a table represents a single record or observation, while each column corresponds to a specific attribute or variable.

Tabular data refers to structured data presented in rows and columns, resembling a spreadsheet. Each row corresponds to a specific record or observation, and each column represents a particular attribute or variable. The first row often contains headers or column names that describe the content of each column, making it easier to understand and interpret the data. Tabular data is commonly used to represent structured information in various domains, including databases, spreadsheets, and data files. It allows for easy organization, sorting, filtering, and analysis of data. Representing data becomes more manageable and facilitates data processing, visualization, and analysis using various software tools and techniques.

Data transfer refers to the process of moving or transmitting data from one location, device, or system to another. In the digital world, data transfers happen all the time. Companies use data transfer to back up their files to remote servers, share large datasets with researchers around the world, and replicate data across multiple data centers to ensure access and availability. Various methods facilitate data transfer, including File Transfer Protocol (FTP) or Secure FTP for transferring files over the internet, network transfers within an organization’s internal systems, data replication for maintaining synchronized copies of data, and cloud-based data transfer services that enable seamless sharing and collaboration. It ensures that data is available where and when it is needed, facilitating collaboration, data sharing, and efficient data management.

V

A variable is a symbolic name or identifier that represents a value or a piece of data. It is a container that holds data and allows it to be manipulated and referenced within a program or software. Variables can store various types of data, such as numbers, text, Boolean values, or complex objects, and their values can change during the execution of the program. By using variables, programmers can store and manipulate data dynamically, enabling flexibility and adaptability in their code. In programming, there are various types of variables, depending on the kind of data they can hold. For example, a variable can hold whole numbers like “5” or decimal numbers like “3.14,” as well as text like someone’s name or a message.