Forum

Notifications
Clear all

Trusted Data

23 Posts
2 Users
0 Reactions
290 Views
(@marsil)
Posts: 72
Estimable Member
Topic starter
 

Hello,

I have request and I don't know if Macros can do it or not. please I need your help.

I extracted data from many website, please check below example.

Company HQ_Z HQ_Craft HQ_Hoovers HQ_BlombergPublic HQ_[World Of Manufacturers] HQ_TheWallstreet
Autoliv, Inc. Sweden USA China Sweden USA USA

 As you see in table many countries for same company, and I want to get Trusted country via compare all extracted countries and determine which one come from many sources.

In above example, we have six values as below

Three sources say USA

Two sources say Sweden

One Source say China

so in this case, the Trusted value is "USA", as it is have the largest count of sources.

Please check attached file to get other samples.

The input will be company name and other columns contain many values and total of sources.

Out put: Three columns as below.

1- The big Matching Count   >>> the count for big sources which have same values.

2- Matching Percentage  >> the ratio between the big count for sources which have same values / total of sources have values.

3- Trusted Data >>> value which come from the big count of sources.

Thank you very much; your support is greatly appreciated.

 

Thanks;

Marsil

 
Posted : 20/04/2021 7:04 pm
(@catalinb)
Posts: 1937
Member Admin
 

Macros can do almost anything, but why just Macro?
Power Query can do it much easier than coding, see an example attached.

 
Posted : 22/04/2021 12:15 am
(@marsil)
Posts: 72
Estimable Member
Topic starter
 

Hi Catalin,

You are amazing! 

Wow! This is awesome and it works like a charm. Thank you so much!!!

Yes it is working correctly without any issue. I just have below question, please can you check.

Some sources show different Country for same company so I extract data from DB using SQL by below structure

ZID Company HQ_Z HQ_Craft HQ_Hoovers HQ_BlombergPublic HQ_[World Of Manufacturers] HQ_TheWallstreet HQ_Finviz HQ_Alacrastore HQ_OldCrunchbase HQ_Investing Total of sources
1000451 Ashland Inc. United States Covington, KY, US India United States   United States USA Netherlands   United States 8
1000451 Ashland Inc. United States Covington, KY, US India United States   United States USA United States   United States 8
1000451 Ashland Inc. United States Covington, KY, US United Kingdom United States   United States USA Netherlands   United States 8
1000451 Ashland Inc. United States Covington, KY, US United Kingdom United States   United States USA United States   United States 8

The result using Power Query as below.

1000451 Ashland Inc. 8 18 United States

So I changed the IDs to make each row has it is own ID and I got good result as I want the Trusted Data per row then I will get  DISTINCT for "Company & Country" then I will take Country which have the big count of sources as below.  Do you think is there any issue will happen if I do this?

1 Ashland Inc. 8 4 United States
2 Ashland Inc. 8 5 United States
3 Ashland Inc. 8 4 United States
4 Ashland Inc. 8 5 United States

How kind you are to help me.

Thanks;

Marsil

 
Posted : 22/04/2021 8:05 pm
(@catalinb)
Posts: 1937
Member Admin
 

Indeed, if the ID's are not unique there will be an issue, there is a grouping step that will group all rows with the same ID.
If you make the ID's unique, there will be no problem

 
Posted : 22/04/2021 11:38 pm
(@marsil)
Posts: 72
Estimable Member
Topic starter
 

Hi Catalin,

Great so I will make the ID's unique.

Last question please.

I will use this file for other types of data like I will compare Found Year, Phone, ....etc.

So I face as issue to determine the count of sources which I past it in "M column" as I use this function =COUNTIFS(R3:AD3,"*",R3:AD3,"<> ")

as it is count the characters not numbers, so can you help me to find other function work with characters and numbers please.

Or If there option on Power Query to count the sources and fill M Column ?

Thanks for helping me accomplish my goal.

Thanks;

Marsil

 
Posted : 23/04/2021 3:39 am
(@catalinb)
Posts: 1937
Member Admin
 

Between Added Custom 1 and Added Custom2 steps, add a new column with this formula:

=List.Sum([Custom][Count])

The full step should look like:
= Table.AddColumn(#"Added Custom1", "Total Count", each List.Sum([Custom][Count]))

 
Posted : 23/04/2021 4:28 am
(@marsil)
Posts: 72
Estimable Member
Topic starter
 

That's a huge help - thanks!

I added above formula and it is running without any issue.

I did other thing and I want to know your comment please.

I removed Changed Type step due sometimes I will want to remove column from sources or change it's names as the sources not fixed as may be I got from other sources like Amazon, yahoo finance, ...etc.

Also may be I will add more columns for sources.

I tested the result and it's okay, but I want your help to review this update in attached file to make sure there is no issue on any step in code as it is first time to make edit in Power Query.

Thanks is not enough, but really don't know what else say to you. Anyway have to say thanks again.

Thanks;

Marsil

 
Posted : 23/04/2021 6:23 am
(@catalinb)
Posts: 1937
Member Admin
 

You're welcome.

If works, there is nothing to review.

Code is already using Unpivot Other Columns in order to handle any new possible columns, if the source has ZID and Company.

The only issue can come from source data, if it contains columns other than HQ columns, those will be counted as HQ.

 
Posted : 23/04/2021 6:39 am
(@marsil)
Posts: 72
Estimable Member
Topic starter
 

I think above issue won't happen as I removed Changed Type step to make the file don't depend on columns names.

Example I changed the columns names as below and it is working without any issue.

ZID Company A B C D E F
1000512 Autohome Inc. Netherlands Netherlands Netherlands Netherlands Netherlands Netherlands

Are u intent this or other thing? as I will depend on this file on urgent project.

Thanks;

Marsil

 
Posted : 23/04/2021 7:25 am
(@catalinb)
Posts: 1937
Member Admin
 

I didn't say that you will have different HQ column names, what I said is that you should not have other column than HQ columns, regardless of the HQ columns names.

For example, if besides HQ column you have another new column like "Company Manager", or "Count of Floating Ducks", the query will not return the correct counts.

 
Posted : 23/04/2021 9:03 am
(@marsil)
Posts: 72
Estimable Member
Topic starter
 

Ofcourse, when I work on this, I will work per feature foe example after I finish HQ and get the final result, I will start with another features like Phone.

I won't work in different values on the same time.

below table will cause issue.

ZID Company Phone_Google HQ_Yahoo email_Craft
1000512 Autohome Inc. 2938373 China [email protected]

But below one is correct

ZID Company email_Google email_Yahoo email_Craft
1000512 Autohome Inc. [email protected] [email protected] [email protected]

 

Thanks again. Couldn't have done it without you.

 

Tahnsk;

Marsil

 
Posted : 23/04/2021 12:36 pm
(@catalinb)
Posts: 1937
Member Admin
 

Ok, what needs to be done with the tables you mentioned? Will have the same Trusted Data process with same counts?

Or there is a different process for emails?

 
Posted : 23/04/2021 12:51 pm
(@marsil)
Posts: 72
Estimable Member
Topic starter
 

First,  I wanted to send my thanks for all your assistance.

Here all details about what I will do, please check.

I build profile for many companies contain many features like Company Headquarter, Phone, Address, Founded Year, Stock Exchange Name, Email, EmployeesCount, Industry, FiscalYear, Logo and MarketCap

And I do this via extract these data from many sources like The Wall street, Blomberg, …etc. and will verify the accuracy by comparing the data which I extracted it.

So the file which you sent me will help on comparison to get the Trusted Data, I think all of features can use the same Trusted Data process.

But I think there are two feature won’t be done with Trusted Data process, please check below.

  • As not all sources write the address with same format.

Ex: CompanyName: News Corp

The Wall street_Address: 1211 Avenue of the Americas New York New York 10036 United States

Alacrastore_Addres: 1211 Avenue of the Americas New York, NY 10036 United States

Marketscreener_Address: 1211 Avenue of the Americas New York, NY 10036

Owler_Address: 1211 Avenue of the Americas New York, New York10036

  • Logo, I got the image URLs and I downloaded it.

Please can you help how I can do this as I try to get any idea.

Thank you very much; your support is greatly appreciated.

Thanks;

Marsil

 
Posted : 24/04/2021 9:55 am
(@catalinb)
Posts: 1937
Member Admin
 

With so unpredictable address variations, it's hard to count them properly with a decent precision unfortunately.

 
Posted : 25/04/2021 12:11 am
(@catalinb)
Posts: 1937
Member Admin
 

You might be able to do something by using fuzzy matching, but it's not very reliable.

In attached example, I had to reduce the similarity threshold to 60%, that can bring wrong results.

 
Posted : 25/04/2021 12:41 am
Page 1 / 2
Share: