4
$\begingroup$

I am working with vehicle registration data from website . The website provides counts for various combinations of vehicle attributes such as Maker, RTO, Fuel, Category, SubCategory, and Emission.

Since directly scraping all combinations (Cartesian product) for each state would result in ~200K combinations per state and may get my IP blocked, I opted to download aggregated pairwise counts manually for each state. I stored them in 15 Excel sheets:

Sheet Name Columns Fuel_Emission Fuel, Emission, Count Maker_RTO Maker, RTO, Count Maker_Fuel Maker, Fuel, Count Maker_Emission Maker, Emission, Count Maker_Category Maker, Category, Count Maker_SubCategory Maker, SubCategory, Count RTO_Emission RTO, Emission, Count RTO_Fuel RTO, Fuel, Count RTO_Category RTO, Category, Count RTO_SubCategory RTO, SubCategory, Count Category_Emission Category, Emission, Count Fuel_Category Fuel, Category, Count Fuel_SubCategory Fuel, SubCategory, Count Category_SubCategory Category, SubCategory, Count SubCategory_Emission SubCategory, Emission, Count

I imported all 15 sheets into SQL Server as separate tables and merged them step by step.

For each combination of Maker, RTO, Fuel, Category, SubCategory, Emission, I mapped all relevant counts from the 15 tables. This gave me a dataset with columns like:

Maker, RTO, Fuel, Category, SubCategory, Emission, MR_Count, MF_Count, ME_Count, MC_Count, MS_Count, RE_Count, RF_Count, RC_Count, RS_Count, FE_Count, FC_Count, FS_Count, CE_Count, CS_Count, SE_Count

Column Name Description MR_Count Count of vehicles for the combination of Maker + RTO MF_Count Count of vehicles for the combination of Maker + Fuel ME_Count Count of vehicles for the combination of Maker + Emission MC_Count Count of vehicles for the combination of Maker + Category MS_Count Count of vehicles for the combination of Maker + SubCategory RE_Count Count of vehicles for the combination of RTO + Emission RF_Count Count of vehicles for the combination of RTO + Fuel RC_Count Count of vehicles for the combination of RTO + Category RS_Count Count of vehicles for the combination of RTO + SubCategory FE_Count Count of vehicles for the combination of Fuel + Emission FC_Count Count of vehicles for the combination of Fuel + Category FS_Count Count of vehicles for the combination of Fuel + SubCategory CE_Count Count of vehicles for the combination of Category + Emission CS_Count Count of vehicles for the combination of Category + SubCategory SE_Count Count of vehicles for the combination of SubCategory + Emission

I now want to calculate a single Final_Count per combination that is consistent with all 15 pairwise counts. Accuracy is very important (≥95%) and memory efficiency is required, since I want the result to match the website closely.

What is the best data science algorithm to compute Final_Count from these 15 pairwise counts?

I’ve heard Iterative Proportional Fitting (IPF) can estimate multi-dimensional contingency tables. Would this be appropriate? Any advice on optimizing for accuracy and convergence?

Here I am attaching sample Input for your reference..

enter image description here

$\endgroup$
2
  • $\begingroup$ Could you please anyone help with this requirement. It's urgent requirement from client side. Please suggest any models with data science approach? $\endgroup$ Commented Oct 11 at 13:21
  • $\begingroup$ A few clarifications would help provide an answer: (1) Should the Final_Count satisfy all marginal constraints exactly when summed? (2) Are all pairwise tables complete, or might some pairs be missing? (3) Is there a hierarchical relationship between Category and SubCategory? (4) Roughly how many unique values exist for each dimension (Maker, RTO, Fuel, etc.)? (5) Do you have any ground truth full combinations to validate accuracy? Understanding these will help recommend the optimal reconstruction algorithm. $\endgroup$ Commented Oct 12 at 19:25

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.