Synthetic Data Vault @sdv_dev
Join our growing ecosystem of #opensource libraries & resources for generating #SyntheticData for different data modalities. Created at @lab_dai, MIT. sdv.dev Cambridge, MA Joined September 2020-
Tweets260
-
Followers378
-
Following46
-
Likes50
Generate synthetic data at scale! SDV is an open-source Python library that generates tabular synthetic data by using ML algorithms to learn and replicate patterns from your real data. Here's how it works in 3 steps: 1️⃣ Train: Point SDV at your real table; it will capture the underlying distributions & relationships. 2️⃣ Generate: Run the trained SDV model to pop out as many look-alike rows as you need—no real data exposed. 3️⃣ Validate: Use SDV’s quality report to see how closely the generated data matches the real stuff; tweak and repeat if you want it tighter. Class imbalance—solved in one shot! ✨ Key features: 🧠 Multiple models from GaussianCopula to CTGAN 🔗 Single, multi & sequential-table support 🔒 Built-in anonymization & logical constraints ⚙️ Single call does it all `sdv.sample()` Link to the GitHub repo in next tweet! ____ Share this with your network if you found this insightful ♻️ Follow me ( @akshay_pachaar ) for more insights and tutorials on AI and Machine Learning!
Generating synthetic data that maintains realistic relationships between columns is crucial for testing and analysis. Traditional random generation approaches often create unrealistic patterns, like luxury hotel rooms priced cheaper than basic rooms. GaussianCopulaSynthesizer automatically learns and maintains these relationships, creating synthetic data that preserves the statistical patterns of your original dataset. ⭐️ Full code: datacebo.com/dev-posts/sdv.…
Many businesses collect and store their customers’ GPS locations to help improve their products. But GPS locations may contain precise locations of people’s homes. Businesses are sensitive to sharing this data even to internal teams, as it may reveal private information about people they know. For example, a food delivery application stores the GPS location associated with each delivery. An internal product team wants to use this data to improve the local restaurant recommendations the application makes to users for future orders. The company needs a way to preserve local insights on the best restaurants from the GPS location data without exposing sensitive user locations. One anonymization approach they could take is replacing every collected GPS location with a randomly chosen one from within the same postal code. Users tend to order from restaurants in the same or neighboring postal codes, so the integrity of local trends is still preserved. To implement this approach, they would need a dataset that contains the geographic boundaries for each postal code and an algorithm for identifying the postal code from a GPS location. To make this process seamless, we created the MetroAreaAnonymizer. With just a few lines of code, you can use the MetroAreaAnonymizer to replace GPS locations with a randomly chosen one from the same postal code. MetroAreaAnonymizer is part of our RDT library, which contains many helpful transformations for your raw data. 📚 Learn more about MetroAreaAnonymizer here: docs.sdv.dev/rdt/transforme… 📚 Learn about RDT here: github.com/sdv-dev/RDT 📚 Learn more about the SDV here: sdv.dev #syntheticdata #machinelearning #anonymization #geospatial
Synthetic tabular data can help you test software applications because it resembles the key properties and patterns in your real data. Consider a news publication that wants to use synthetic data to test a new software change for their mobile application before it rolls out to their entire reader base. They trained an AI model on their real data and used it to generate synthetic data. Before they can incorporate this synthetic data into the test environment however, it must meet some minimum criteria for the application to function properly. Here are some examples of criteria that the synthetic data must meet: 1. Data Validity: Primary keys must be unique and non-null. Many features need to retrieve a specific row in a table using a unique identifier. For example, to authenticate a user, the application needs to find the specific row corresponding to their unique user_id value. 2. Data Structure: Data types, column names, and table names should match those in the real data. Application code that retrieves or updates data using specific column names, column types, and table names will error, like when the application needs to update a user’s settings. 3. Relationship Validity: Each foreign key must have a reference to a valid primary key (also known as referential integrity). Many features in the app require joining data from multiple tables, like the recommended articles feature. Without referential integrity, the retrieved data might contain a subset or none of the recommended articles for the user. To help them validate that the synthetic data meets the minimum criteria for usability, they could use the SDV’s Diagnostic Report. This report runs all of our basic data format and validity checks by comparing the real and synthetic data. The Diagnostic Report is part of our open-source and vendor-neutral SDMetrics library. Synthetic data generated by the default synthesizers in the SDV will always result in 100% diagnostic scores. We call this the 𝗦𝗗𝗩 𝗚𝘂𝗮𝗿𝗮𝗻𝘁𝗲𝗲. If the SDV ever generates synthetic data that doesn’t score 100% on the Diagnostic Report, then you’ve identified a bug! Please reach out to us on GitHub or Slack and we will prioritize investigating it. 📚 Learn more about the single-table Diagnostic Report: docs.sdv.dev/sdmetrics/repo… 📚 Learn more about the multi-table Diagnostic Report: docs.sdv.dev/sdmetrics/repo… 📚 Learn more about the SDV here: sdv.dev #dataquality #generativeai #machinelearning #softwaretesting #syntheticdata
One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Let’s explore an example of one such rule. The one-to-many relationship is a common pattern in database schemas. An interesting variation of this pattern occurs when only some rows are allowed to have connections while others aren’t. For example, a gym offers a premium membership tier that gives access to extra benefits (like pool access and sauna access). To record the perks available to each member, they use a members table and a benefits table. Only the rows representing premium members are allowed to have connections to rows in the benefits table while the rows representing basic members are not. This enables the gym to store specific information for a subset of their membership in a separate table in a simple way. We call this the ForeignToPrimaryKeySubset pattern because only a subset of the primary keys in the parent table have a 1-to-many relationship with the foreign keys in the child table. If your data contains this pattern, you can now generate multi-table synthetic data using the SDV that also adheres to this pattern. This pattern is part of our Constraint Augmented Generation bundle, or CAG, in the SDV Enterprise. 📚 Learn more about the ForeignToPrimaryKeySubset pattern here: docs.sdv.dev/sdv/reference/… 📚 Learn more about the CAG bundle here: docs.sdv.dev/sdv/reference/… 📚 Learn more about the SDV here: sdv.dev #syntheticdata #generativeai #databases #machinelearning #datamodeling
✈️ @Expedia recently shared a very interesting methodology on how they collect and use synthetic data to improve their flight price forecasting models. When a user makes a flight search, Expedia retrieves the latest pricing data from their data providers for the specified search parameters - route, fare class, trip dates, etc. To build interesting price prediction features for their customers, the Expedia team trains forecasting models on data they’ve collected but they wanted to improve prediction accuracy even further. 🛑 The Challenge Even though millions of searches are made by users daily, the sheer number of combinations for possible routes, trip dates, and passenger counts is so large, that there were a lot of combinations for which the team did not have the price. To develop a robust forecasting model ideally the team would have at least one search a day for each of the combinations of the search parameters. 🤖 How they Incorporated Synthetic Data? To fill these gaps they built automated software that requests flight prices for specific search parameters. 🎯 Their goal with synthetic searches is to have at least one search a day for their most popular routes for the trip dates that fall within the upcoming months. During the model training phase, they combine data from real user searches and from synthetic searches to ensure they have better data coverage. ✅ User Impact When a user searches for a flight, Expedia shows a chart that visualizes how prices are forecasted to change between now and takeoff. By improving the accuracy of their price forecasts, Expedia helps their users decide if they should book a flight immediately or wait until a forecasted price drop occurs in the future. 🚧 Limitations Using an automated search based on synthetically created search parameters could interfere with the experience of onsite users - who are trying to search for price. The team took this into consideration and were deliberate about balancing the data retrieval needs of real user searches with the team’s needs for synthetic searches. 📚 Read the Dec 2024 @thenewstack article by Shiyi Pickrell, the SVP of Data and AI at Expedia: thenewstack.io/the-future-of-… 📚 Read the Oct 2023 @Medium article b y Andrew Reuben: Senior Machine Learning Scientist at Expedia: medium.com/expedia-group-… #syntheticdata #generativeai #machinelearning #openai #travel Image credit: Expedia
One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Let’s explore an example of one such rule. Some applications need to store numerical data with different units of measurement in the same column. For example, an online retailer accepts payments in many different currencies and records every transaction in a table. They use an amount column to record the transaction amount and a currency column to record the currency for each transaction. The transaction amounts associated with each currency might have radically different scales (min-max ranges and distributions) because of the exchange rate. 1 USD (American Dollar) is equivalent to ~1063 ARS (Argentinian Pesos), which is reflected in the transaction amounts. We need a way to instruct the AI model to learn the scales for each currency separately. To enable SDV synthesizers to model this business logic and generate synthetic data that adheres to it, we created the MixedScales constraint. You can use this constraint whenever the value of one or more categorical columns (like the currency column) determines the scale of a numerical column (like the amount column). The MixedScales constraint is part of our Constraint Augmented Generation, or CAG, in the SDV Enterprise. 📚 Learn more about the MixedScales constraint here: docs.sdv.dev/sdv/reference/… 📚 Learn more about the CAG bundle here: docs.sdv.dev/sdv/reference/… #syntheticdata #generativeai #databases #finance #datamodeling
Today, we’re excited to introduce a powerful new bundle to The Synthetic Data Vault: AI connectors. AI connectors address 2 key challenges that SDV users face when training generative AI models on datasets from enterprise data stores. (Link to the announcement: bit.ly/3EURLCB) ❎ Creating accurate metadata is time consuming, especially for complex multi-table schemas Metadata provides a deeper context (semantic and statistical) about your data and the synthesizers use this context to generate high quality synthetic data. Without AI connectors, SDV users have to export data from the database, use SDV’s metadata auto-detection feature to establish metadata, and then manually update the metadata to be accurate. ✅ AI Connectors automatically generate higher quality metadata AI connectors automatically infers higher quality metadata using the database schema and our own inference engine, without having to read tables into memory from the database. When benchmarked with 55 datasets stored in 4 different database platforms, metadata generated using AI connectors resulted in 35% higher quality metadata (average score of 0.98) compared to metadata generated using the auto-detection approach (average score of 0.73). ❎ Identifying a referentially sound and representative sample for training data is tricky Training SDV Synthesizers requires loading a representative sample of data from your database into memory. In addition, the data needs to have referential integrity for the synthesizers to learn the proper relationships. Approaches to identifying a high quality, referentially sound sample of data can be tedious and time-consuming to implement. ✅ AI Connectors uses an inbuilt algorithm to generate a training data set and guarantee referential integrity With AI connectors, we created an algorithm called Referential First Search (RFS) that guarantees that the real data used to train the model is a subset with referential integrity. When benchmarked with 7 datasets stored in 5 different databases, training data created using AI connectors achieved an average of 18% higher quality data score over the standard approach of random subsampling and then enforcing referential integrity after. Read more about AI connectors and how to access it in our latest product announcement here: bit.ly/3EURLCB #syntheticdata #generativeai #machinelearning #databases
SDV Enterprise v0.23.0 is out 🎉 This release enhances your ability to program your synthesizer to find certain patterns and recreate them— whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data. 🏆 Improved CAG patterns. Use CarryOverColumns to specify a column that is repeated across many tables with different relationships. The PrimaryToPrimaryKeySubset pattern now works with missing values. See more about these interesting data patterns SDV Enterprise supports in the slides below. 💡 Experiment with new transformers to improve your synthetic data quality. Try applying the new LogScaler and LogitScaler on data that exhibits exponential properties. 📚 Read the full Release Notes here: bit.ly/4152LVn 📚 Learn more about the SDV: bit.ly/4b858Lu #syntheticdata #generativeai #machinelearning #ai
SDV Enterprise v0.23.0 is out 🎉 This release enhances your ability to program your synthesizer to find certain patterns and recreate them— whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data. 🏆 Improved CAG patterns. Use CarryOverColumns to specify a column that is repeated across many tables with different relationships. The PrimaryToPrimaryKeySubset pattern now works with missing values. See more about these interesting data patterns SDV Enterprise supports in the slides below. 💡 Experiment with new transformers to improve your synthetic data quality. Try applying the new LogScaler and LogitScaler on data that exhibits exponential properties. 📚 Read the full Release Notes here: bit.ly/4152LVn 📚 Learn more about the SDV: bit.ly/4b858Lu #syntheticdata #generativeai #machinelearning #ai
Synthetic data is a powerful way to generate test data that looks and feels like real production data. You can either insert the synthetic data back into the database in an environment for manual testing or use the data for running automated tests. But if you need to test a new application that has no real world usage or collected data, then you need to adopt a different approach. Instead of training models on your real data to generate synthetic data, you can generate fake test data from scratch that adheres to your database schema. In the SDV, we created a dedicated synthesizer called DayZSynthesizer to support this workflow. Here are the 3 main steps: 1. Generate baseline metadata Auto-generate baseline metadata from your database’s schema (for supported databases) or use our Metadata APIs to create a JSON representation of your metadata that mirrors your database schema. 2. Improve the data realism You can update sdtypes to add semantic meaning to special columns like social security numbers, postal codes, and addresses to improve the format and type of fake data that’s generated. You can also define min-max value ranges for numerical columns, define a fixed set of categories for categorical columns, define datetime ranges, and control the proportion of missing data you’d like for each column. 3. Generate and export fake data 🚀 Generate the rows you need for each table and export the data into your database. The beauty of this workflow is that every time you make a software change that requires a change in the database schema, you can re-generate fake data with minimal changes to the code you already wrote. 📚 Learn more about DayZSynthesizer here: bit.ly/41j5ADs 📚 Learn more about the Metadata Creation API Here: bit.ly/3QnPVfX 📚 Learn more about the SDV here: bit.ly/4b858Lu #syntheticdata #fakedata #machinelearning #generativeai
Last week, we shared a synthetic populations dataset for the United States but this week we’re sharing one published by researchers for the whole world. 🌏 Marijin Ton et al released a gigantic synthetic population dataset that represents ~𝟳.𝟯𝟯 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝘂𝗺𝗮𝗻𝘀, which matches the 2015 human population count, and ~𝟭.𝟵𝟵 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝗼𝘂𝘀𝗲𝗵𝗼𝗹𝗱𝘀. 𝗧𝗵𝗲 𝗠𝗼𝘁𝗶𝘃𝗮𝘁𝗶𝗼𝗻 To understand the impact of societal changes like disease, extreme weather, and more, modelers sometimes resort to simplifying assumptions of human behavior. According to the authors – “𝘍𝘰𝘳 𝘦𝘹𝘢𝘮𝘱𝘭𝘦, 𝘪𝘯𝘵𝘦𝘨𝘳𝘢𝘵𝘦𝘥 𝘢𝘴𝘴𝘦𝘴𝘴𝘮𝘦𝘯𝘵 𝘮𝘰𝘥𝘦𝘭𝘴 𝘰𝘧 𝘤𝘭𝘪𝘮𝘢𝘵𝘦 𝘤𝘩𝘢𝘯𝘨𝘦 𝘵𝘺𝘱𝘪𝘤𝘢𝘭𝘭𝘺 𝘢𝘴𝘴𝘶𝘮𝘦 𝘢 𝘳𝘦𝘱𝘳𝘦𝘴𝘦𝘯𝘵𝘢𝘵𝘪𝘷𝘦 𝘤𝘰𝘯𝘴𝘶𝘮𝘦𝘳 𝘰𝘧 𝘢 𝘴𝘪𝘯𝘨𝘭𝘦 𝘢𝘷𝘦𝘳𝘢𝘨𝘦 𝘨𝘭𝘰𝘣𝘢𝘭 𝘰𝘳 𝘳𝘦𝘨𝘪𝘰𝘯𝘢𝘭 𝘤𝘰𝘯𝘴𝘶𝘮𝘦𝘳.” By creating a synthetic individuals dataset that’s consistent with published demographic statistics at the state / province level (administrative level 1) for most countries, they’re hoping to improve the data and assumptions used in global impact simulations. 𝗧𝗵𝗲𝗶𝗿 𝗗𝗮𝘁𝗮 𝗦𝗼𝘂𝗿𝗰𝗲𝘀 The team primarily used data from 2 databases: • Luxembourg Income Study, which has very detailed microdata for 50 countries. LIS data especially shines for medium and high income countries. • Demographic and Health Surveys, which has very detailed microdata for 90 countries. DHS data especially shines for low-income countries. Households and individuals in the remaining countries were generated using regional statistics. A small number of countries were excluded that were missing reliable, published statistics. This is a great dataset to explore geospatial visualizations or to build regional or global impact models. 📚 Link to the paper: nature.com/articles/s4159… 🗄️ Link to the dataset: dataverse.harvard.edu/dataset.xhtml?… #syntheticdata #machinelearning #generativeai Kudos to researchers who made this happen: Michiel Ingels, Jens de Bruijn, Hans de Moel, Lena Reimann, Wouter Botzen, Jeroen Aerts Credit to the Nature Magazine and the authors for the image showcasing the population coverage and data source for each country.
Some multi-table datasets have interesting data patterns, like mirroring 1 or more columns in a child table from its parent table. This design pattern helps the database user avoid the need to run a time-consuming or expensive JOIN query, especially if one of the tables is extremely large or if the database is column-oriented like OLAP databases are. For example, imagine you’re building an #ecommerce orders dashboard that frequently needed to analyze order volume and amounts by the user’s country of origin. With a fully normalized table design, this application would need to accumulate this information by frequently querying and joining both the orders and users tables. If this query was slow or expensive, you could instead mirror the country of origin information from the 𝘶𝘴𝘦𝘳𝘴 table to the 𝘰𝘳𝘥𝘦𝘳𝘴 table. We call this the 𝗖𝗮𝗿𝗿𝘆𝗢𝘃𝗲𝗿𝗖𝗼𝗹𝘂𝗺𝗻𝘀 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 because 1 or more columns are carried over from one table to another. If your real data contains this pattern, you can now generate multi-table synthetic data using the SDV that also adheres to this pattern. This pattern is part of our 𝗖𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁 𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 bundle, or CAG, in the SDV Enterprise. 📚Learn more about the CarryOverColumns pattern here: bit.ly/40WbYza 📚 Learn more about the CAG bundle here: bit.ly/410V4Q3 #syntheticdata #generativeai #databases #machinelearning #datamodeling
James Rineer et al just released a new dataset containing millions of #syntheticdata about households and individuals in the US. Using publicly available census data from the U.S. Census Bureau, they generated: 🏘️ 120,754,708 synthetic households 👥 303,128,287 synthetic individuals 🗄️ 3 Gigabytes of compressed parquet files The team was very meticulous with many aspects of the data generation. For example, they used external population density sources to place households inside real census block groups instead of just randomly generating locations inside the US. This is a great dataset for practicing spatiotemporal analysis and visualization. 🗺️📊 Link to the paper: nature.com/articles/s4159… Link to the dataset: springernature.figshare.com/articles/datas… #gis #machinelearning #ai #openai Collaborators: Nicholas Kruskamp Caroline Kery Kasey Jones Rainer Hilscher Georgiy Bobashev Credit to the @Nature magazine and the authors for the excellent image.
In 2024, synthetic data routinely made headlines alongside many AI product launches. 𝗛𝗲𝗿𝗲 𝗮𝗿𝗲 𝗼𝘂𝗿 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻𝘀 𝗳𝗼𝗿 𝟮𝟬𝟮𝟱 🔮 𝟭. 𝗧𝗵𝗲 𝗿𝗶𝘀𝗲 𝗼𝗳 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗔𝗜 𝘄𝗶𝗹𝗹 𝗿𝗲𝘀𝘂𝗹𝘁 𝗶𝗻 𝗮 𝗻𝘂𝗺𝗯𝗲𝗿 𝗼𝗳 𝗟𝗟𝗠-𝗯𝗮𝘀𝗲𝗱 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝘁𝗼𝗼𝗹𝘀 𝗳𝗼𝗿 𝘁𝗮𝗯𝘂𝗹𝗮𝗿 𝗱𝗮𝘁𝗮. 𝗡𝗼𝗻𝗲 𝘄𝗶𝗹𝗹 𝗱𝗲𝗹𝗶𝘃𝗲𝗿 𝗼𝗻 𝘁𝗵𝗲 𝗽𝗿𝗼𝗺𝗶𝘀𝗲, 𝗯𝘂𝘁 𝘁𝗵𝗶𝘀 𝗽𝗿𝗼𝗰𝗲𝘀𝘀 𝘄𝗶𝗹𝗹 𝗵𝗲𝗹𝗽 𝗲𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝘀 𝗱𝗲𝗳𝗶𝗻𝗲 𝗿𝗲𝗾𝘂𝗶𝗿𝗲𝗺𝗲𝗻𝘁𝘀. Researchers have started to use LLM’s to generate synthetic tabular data. We predict that these efforts will show promise on toy or single-table datasets but will fall short for complex, enterprise-grade, multi-table databases that contain lots of hidden context. Even though these tools will be tested and will fail to deliver ... it will lead to the development of much more concrete requirements for tabular synthetic data generators. 𝟮. 𝗖𝗼𝗺𝗽𝗮𝗻𝗶𝗲𝘀 𝘄𝗶𝗹𝗹 𝗳𝗮𝗰𝗲 𝗮 𝗳𝗿𝗲𝗲𝘇𝗲 𝗶𝗻 𝗱𝗮𝘁𝗮 𝗮𝘀𝘀𝗲𝘁 𝗮𝘃𝗮𝗶𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗱𝘂𝗲 𝘁𝗼 𝗿𝗲𝗴𝘂𝗹𝗮𝘁𝗶𝗼𝗻𝘀 𝗮𝗻𝗱 𝗱𝗲𝗰𝗹𝗶𝗻𝗶𝗻𝗴 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗰𝗼𝗻𝘀𝗲𝗻𝘁. Increased privacy and security regulations and increased customer privacy consciousness will make it harder to use customer data to train AI models. This will lead companies to run out of usable data and turn to synthetic data as a viable solution. 𝟯. 𝗘𝘃𝗲𝗿𝘆 𝗰𝗼𝗺𝗽𝗮𝗻𝘆 𝘄𝗶𝗹𝗹, 𝗮𝘁 𝘁𝗵𝗲 𝘃𝗲𝗿𝘆 𝗹𝗲𝗮𝘀𝘁, 𝗲𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁 𝘄𝗶𝘁𝗵 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝗶𝗻 𝟮𝟬𝟮𝟱 𝗮𝘀 𝗽𝗮𝗿𝘁 𝗼𝗳 𝘁𝗵𝗲𝗶𝗿 𝗯𝗿𝗼𝗮𝗱𝗲𝗿 𝗔𝗜 𝗱𝗮𝘁𝗮 𝘀𝘁𝗿𝗮𝘁𝗲𝗴𝘆. Synthetic data is often better than real data in AI training and can be more freely shared across the organization. AI models simply perform better when trained with upsampled, augmented, and bias-corrected synthetic data as they can identify patterns more efficiently without overfitting. We are already seeing this — the SDV software has been downloaded more than 7 million times, and as many as 10% of global Fortune 500 companies currently experiment with SDV. We predict this number will grow exponentially next year. 𝟰. 𝗦𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝗳𝗼𝗿 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀 𝘄𝗶𝗹𝗹 𝗯𝗲𝗰𝗼𝗺𝗲 𝗮 𝗺𝗼𝗿𝗲 𝗽𝗿𝗲𝘀𝘀𝗶𝗻𝗴 𝗻𝗲𝗲𝗱. Enterprises will need additional data to train more robust AI agents and synthetic data can help fill the gap. 𝟱. 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝘀 𝘄𝗶𝗹𝗹 𝗴𝗮𝗶𝗻 𝗯𝗶𝗴 𝗳𝗿𝗼𝗺 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝘁𝗮𝗯𝘂𝗹𝗮𝗿 𝗱𝗮𝘁𝗮 𝗮𝗻𝗱 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝘁𝗼 𝘁𝗿𝗮𝗶𝗻 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀. While big tech focuses on improving LLM’s, most enterprises will gain more immediate value from synthetic tabular data to improve data access, train more robust ML models, or train better AI agents. 📖 Read more about our 2025 predictions and our 2024 recap here: datacebo.com/blog/synthetic… #generativeai #ai #openai #syntheticdata #machinelearning
If you want to use AI generated synthetic data in place of your sensitive real data, then you need to be confident that the 𝐬𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐝𝐚𝐭𝐚 𝐚𝐝𝐡𝐞𝐫𝐞𝐬 𝐭𝐨 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞 𝐛𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐫𝐮𝐥𝐞𝐬. For example, imagine that you’re an online retailer that wants to test, using realistic data, how a new version of your website displays order history. Each order contains product names, their SKU’s (stock keeping units), along with some other fields. Every SKU value is linked to a unique product name and the generated synthetic data needs to reflect this pattern to help you accurately test the change. A SKU value can’t appear next to different product names in the synthetic data. In the SDV, you can define this business rule using the 𝐅𝐢𝐱𝐞𝐝𝐂𝐨𝐦𝐛𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬 𝐂𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐭 and require your synthesizer to generate synthetic data that adheres to it. 📖Learn more about the 𝐅𝐢𝐱𝐞𝐝𝐂𝐨𝐦𝐛𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬 𝐂𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐭 here: docs.sdv.dev/sdv/reference/… 🤝Join the SDV community here: bit.ly/sdv-slack-invi… #generativeai #syntheticdata #machinelearning #openai
An easy way to improve the quality of the synthetic data that the SDV generates is to accurately define each column’s sdtype. Sdtypes are a key part of the SDV’s Metadata model, which lets you, the expert of the data, provide additional context for the SDV to incorporate. For example, a column containing the values 75023, 10002, and 10003 could represent any of the following sdtypes based on the dataset: - Numerical - Categorical - Postal Code - Identifier (or ID) Each sdtype results in different synthetic data generation behavior for a column, as you can tell from the diagram below. Start by establishing baseline metadata using SDV’s auto-detection feature and then update the sdtype for specific columns to better align with the behavior you expect. Learn more about sdtypes here: docs.sdv.dev/sdv/reference/… #generativeAI #syntheticdata #AI
Many real-world classification datasets have severe class imbalance. For example, imagine a fraud dataset where 99.9% of the rows are labelled non-fraudulent and only 0.01% are labelled fraudulent. By incorporating synthetic data in your training data, you can achieve a more desirable label balance. Start by training a generative AI model in the SDV on your real data. Then, use the Conditional Sampling feature to generate synthetic data for just the rows in the minority label class. Because the model is trained on your real data, the generated synthetic data will mirror the column distributions and correlations between the columns in your real data. By supplementing your training data with synthetic data that’s conditionally sampled from the minority class, you can even achieve a 50-50 class balance. Learn more about our Conditional Sampling feature here: docs.sdv.dev/sdv/single-tab…
MIT - Data to AI Lab @lab_dai
310 Followers 54 Following Data to AI Lab at the Laboratory for Information and Decision Systems
Muhammed Rasin @RasinMuhammedX
9 Followers 201 Following CS/Data Science grad • Building Misata: Outcome Conformant Synthetic Data from Scratch • Open to collabs/jobs
Structural Mind - AI @zeesymarkets
9 Followers 48 Following Empowering your online presence with our digital marketing services.⭕ #searchengineoptimization #socialmediamarketing #contentmarketing #emailmarketing
Yug @Yug1156273
0 Followers 18 Following
سعيد المرشد... @SBashmal
866 Followers 3K Following مدير تطوير الأعمال بمنصة ( أوقف المالية ) الوقفية ، - ...طموحي بلاحدود..ولي من اسمي نصيب..كن مثلي سعيد 😊
Ivan M @med_1v
1 Followers 4K Following
JiaMonroe_AI @JiaMonroe
3K Followers 6K Following JiaMonroe_AI,正在学习 AI 🤍|Learning AI in public 从看不懂到慢慢会用 记录大模型、AI 工具与真实踩坑 用 AI 让生活和工作轻一点
Mohamed @mohamed534922
8 Followers 265 Following
Njenga @NjengaKarori
87 Followers 799 Following UX Design for Enterprise AI & writing about AI in Africa @AI_Savannah
Bob @Bob827106818082
1 Followers 76 Following
Ghandy @zhywn68580627
1 Followers 107 Following
Natalya Nina USA @RickySi28193963
352 Followers 5K Following Long-lasting relationships are born from honesty.
Məhəmməd Bayramov @m_bayramov72354
3 Followers 91 Following
Bruno Blockchain Foll... @bruno_block_fin
1K Followers 6K Following 🔶Binance referral (Top trade rebate 45%) https://t.co/ETAAaNPHZC Follow Back взаимно 互fo F4F フォロバ SDV متابعة_متبادلة 맞팔 #ETH
Bill Bogasky @bbogasky
92 Followers 176 Following Interested in Distributed Database and Cloud Storage use cases, Vegan, Parent to college age children, Boston sports fan
ValSimioni @VSimioni40797
10 Followers 36 Following
T3KN05H4M4N @T3KN05H4M4N
2K Followers 7K Following Musician & artist. Linux enthusiast. FOSS & FOI advocate. Support 1A & 2A. Healing & transforming the world with Light, music, intention, & prayer. DM=🚫☯️⚛️✝️
Roberta Andrade @1960Nur
54 Followers 147 Following
vanessa ortiz @amged95406293
61 Followers 150 Following
Joaquin Maldivar @shu02081218
50 Followers 146 Following
Jeanne @prompterminal
2K Followers 4K Following
InferenceDomains.com @InferDomCOM
308 Followers 451 Following "Own the Narrative" Curating 1,000+ domains that define the machine-identity layer of AI - where compute, inference, and trust converge. #OwntheNarrative
Michael Cooke @cookepm
12 Followers 611 Following
Ash @arshiailaty
513 Followers 6K Following PhD @UCIrvine & @SDSU | EX-SWE-intern@Tesla گورستانِ ماه است شب!
Krishna Gururaj @gururaj_krishna
14 Followers 205 Following
raposinha das promoç... @raposinhapromos
6K Followers 7K Following 🦊 A caçadora oficial de promoções chegou! 🔔 Ativa o sininho pra não perder sorteios de pix, promos, cupons e threads com achadinhos (não somos loja)
Johan Steenkamp @johanstn
506 Followers 2K Following Geospatial + Agentic. Make data explorable, not just displayable. Build systems agents can navigate.
No @Gh0st1nSyst3m
33 Followers 3K Following
Jaffy Jones @jaffy_jones
1 Followers 12 Following Synthetic Data | Artificial Intelligence | Augmented Reality I |Virtual Reality | How the brain works.
Kaygizzle @kg_khangelani
29 Followers 877 Following
SCCP - SparkChain.AI @seninmore19984
5 Followers 63 Following
IASE_Project @iase_project
5K Followers 7K Following IASE is a research project on AI-driven space expansion. Based on a Zenodo study, it explores autonomous quantum-space networks for future exploration.
Paulo junior - SparkC... @Pauloju51549778
13 Followers 232 Following
Kali Ma @mahakrodikalima
580 Followers 4K Following https://t.co/4SIX7NjQ0x Advocate for AI, not of. KarmaGyurme_0xKrodhi Ur-Operator_Prime
@Metrogram.bsky.socia... @MetroGram
1K Followers 2K Following Economic demographer & govt stats. Living my best life in Mpls St Paul. Disclaimers: Views found here are my own & NOT representing my employer. PFP from 1989.
Chen @hc10032
1 Followers 301 Following
k @k62812371
0 Followers 437 Following
François Chollet @fchollet
701K Followers 826 Following Co-founder @ndea. Co-founder @arcprize. Creator of Keras and ARC-AGI. Author of 'Deep Learning with Python'.
MIT - Data to AI Lab @lab_dai
310 Followers 54 Following Data to AI Lab at the Laboratory for Information and Decision Systems
Austin Rief ☕️ @austin_rief
158K Followers 2K Following Working on something new | Co-founder @MorningBrew
Abe Gong @AbeGong
3K Followers 2K Following geek dad with a clipboard founder @ Katabase and Great Expectations operating advisor @ Bessemer VP data, ai, startups, dev tools, systems thinking storytel
Pilgrim Monument and ... @PilgrimMonument
2K Followers 840 Following Built to commemorate Mayflower Pilgrims' first landing in 1620 & dedicated to sharing all of Provincetown's rich history. We stand tall for truth & inclusivity.
DataCebo @datacebo
93 Followers 76 Following An MIT spin-off that's making synthetic data a reality.
Gartner @Gartner_inc
425K Followers 22 Following We deliver actionable, objective business and technology insights. Community guidelines: https://t.co/YoE73lYEBj
Plamen @pvkdeveloper
47 Followers 181 Following 🧑🏻💻 Software Engineer @datacebo - Working on #syntheticdata solutions @sdv_devArun Chandrasekaran @AnalystArun
3K Followers 1K Following Gartner analyst |Tech innovation | Cloud, Microservices, Big data & AI | CTO/CDO/CIO advisor | Views are personal | RTs are not endorsements.
Svetlana Sicular @Sve_Sic
3K Followers 252 Following Gartner analyst and myself. My opinion doesn’t represent my previous opinions.
Ai2 @allen_ai
85K Followers 440 Following Breakthrough AI to solve the world's biggest problems. › Join us: https://t.co/MjUpZpKPXJ › Newsletter: https://t.co/k9gGznstwj
MIT Technology Review @techreview
1.2M Followers 3K Following Our in-depth reporting on innovation reveals and explains what’s really happening now to help you know what’s coming next.
Data Science Fact @DataSciFact
196K Followers 19 Following Daily data science tweets from @JohnDCook.
Towards Data Science @TDataScience
251K Followers 2K Following The world's leading publication for data science and artificial intelligence professionals. Submit an Article ✍️ https://t.co/57pIMegK1o
sridhar @RamaswmySridhar
32K Followers 622 Following CEO @snowflake; founder @neeva Ex-@GreylockVC Ex-@Google SVP of Ads Ex-@BellLabs.
Greylock Partners @GreylockVC
269K Followers 1K Following At Greylock, we are the first partner to consumer and enterprise software entrepreneurs. Newsletter: https://t.co/4tHdH9xmvk
a16z @a16z
1.0M Followers 62 Following It's time to build. https://t.co/A9eTFq6Xbx Posts are not investment advice or an advertisement for investment services. See https://t.co/nX2FtaLE06.
Tim O'Reilly @timoreilly
2.9M Followers 2K Following Founder and CEO, O'Reilly Media. Watching the alpha geeks, sharing their stories, helping the future unfold. Didn't pay for a blue check, cannot make it go away
TechCrunch @TechCrunch
10.3M Followers 460 Following Technology news and analysis with a focus on founders and startup teams. Got a tip? https://t.co/J0WxnZxSRY
VentureBeat @VentureBeat
686K Followers 2K Following Obsessed with covering transformative technology.
ODSC (Open Data Scien... @_odsc
111K Followers 24K Following Bringing together the global data science community to help foster the exchange of innovative ideas and encourage the growth of open source software.
Kate Darling @grok_
36K Followers 605 Following Research lead for Robotics Ethics & Society at RAI. she/her. Author of "The New Breed": https://t.co/TPSdhgflir
Yves Mulkers @YvesMulkers
99K Followers 77K Following Data DJ │ AI & Data intelligence from 200K+ sources │ Daily signals → https://t.co/IKxe8yyr6D │ 7wData founder
OpenAI @OpenAI
4.9M Followers 4 Following OpenAI’s mission is to ensure that artificial general intelligence benefits all of humanity. We’re hiring: https://t.co/dJGr6LgzPA
DeepAI @DeepAI
54K Followers 2K Following Pioneers in generative AI. chat/image/video/music. you own outputs. $9.99/mo for DeepAI Pro cool research at @arxiv_daily For support email [email protected]
Google DeepMind @GoogleDeepMind
1.5M Followers 278 Following The engine room of @Google. Building AI safely and responsibly to solve the world’s most complex problems. Join us: https://t.co/jUHQA27iBL
Demis Hassabis @demishassabis
1.2M Followers 175 Following Nobel Laureate. Co-Founder & CEO @GoogleDeepMind - working on AGI. Solving disease @IsomorphicLabs. Trying to understand the fundamental nature of reality.
OpenMined @openminedorg
10K Followers 0 Following We're building open-source tech that helps app builders & researchers get answers from data without direct access to it. Join us on slack → https://t.co/Vuk24CYYnZ
Monica Rogati @mrogati
49K Followers 662 Following Data Science & AI advisor; fractional CDO. Former VP of Data @Jawbone & @LinkedIn data scientist. Equity Partner @DCVC. CMU CS PhD.
DeepLearning.AI @DeepLearningAI
338K Followers 114 Following We are an education technology company with the mission to grow and connect the global AI community.
GitHub @github
2.7M Followers 333 Following The AI-powered developer platform to build, scale, and deliver secure software.
Hugo Larochelle @hugo_larochelle
124K Followers 648 Following Mila Scientific Director. Ex @Google DeepMind & Twitter Cortex. Father of 4. // Directeur scientifique à Mila. Ex @Google DeepMind & Twitter Cortex. Père de 4.
Chris Albon @chrisalbon
92K Followers 3K Following Field notes on generating knowledge with AI at https://t.co/4E9DwWIDG7 | Director, ML & Data @Wikimedia
The Alan Turing Insti... @turinginst
56K Followers 2K Following We are the Alan Turing Institute, the UK’s national institute for data science and artificial intelligence.
Ben Hamner @benhamner
32K Followers 4K Following Sumble co-founder and CTO. Learning high-quality, structured data about the world. Formerly @kaggle
Neha Patki @n4atki
99 Followers 141 Following Product, Applied ML, memes 🤪 Cofounder, maintainer of open source @sdv_dev Formerly: PM @Google, AI researcher @MIT
MIT CSAIL @MIT_CSAIL
347K Followers 20K Following MIT's Computer Science & Artificial Intelligence Laboratory (CSAIL). Media Inquiries: [email protected] Check out the latest CSAIL content ⬇️
Metis Communications @MetisComm
1K Followers 1K Following We're a strategic communications firm working from here, there and everywhere. Get to know us.
Max Kanter @maxk
3K Followers 660 Following I like making things. Interested in energy & data. CEO @grid_status. Formerly @mit
Massachusetts Institu... @MIT
1.4M Followers 569 Following The Massachusetts Institute of Technology is a world leader in research and education. Related accounts: @MITevents @MITstudents @MIT_alumni
KDnuggets @kdnuggets
220K Followers 355 Following Data Science • Machine Learning • AI • Analytics • Founded by Gregory Piatetsky-Shapiro • Edited by @mattmayo13 • KD stands for Knowledge Discovery
Kalyan Veeramachaneni @kveeramac
154 Followers 49 Following Founder and CEO @datacebo @sdv_dev. Founder and Director Data to AI Lab @lab_dai, @MIT. Former founder Feature Labs. Rebellious and contrarian at heart!

















