Forum

Resource Usage and ...
 
Notifications
Clear all

Resource Usage and Performance - .xls/.xlsx versus .csv/.txt

16 Posts
3 Users
0 Reactions
154 Views
(@chadkukorola)
Posts: 10
Active Member
Topic starter
 

Hello,

This is a technical question, as I have seen the differences, but wanted to see if someone knew the "why" behind the reality.

First, there are a couple things which I can't change: 32-bit Excel No direct connections to a DB or server. As such, I am inevitably stuck downloading .xls or .csv from a Data Warehouse, ERP, or other such web-based system. In testing, using a combination of Power Query and Loading into Power Pivot, the data model "failed" due to a lack of system resources when loading from .xlsx.

Specifically:

.xlsx - ~800K records and 46 fields - Ran out of system resources and was not able to load the dataset

.txt - 7.5 M records and 46 fields - Same data, duplicated for testing. I stopped at 7.5M records, but could have kept going because I wasn't near the 2 gig limit for 32-bit

This means it's certainly worth the time to convert to a text-type file, but why is there such a dramatic difference?

Thanks!

Chad

 
Posted : 26/02/2017 4:28 pm
(@mynda)
Posts: 4761
Member Admin
 

Hi Chad,

When you imported the Excel data did you open the Excel file containing the data and then load to Power Query and Power Pivot all from within the same workbook?

Mynda

 
Posted : 26/02/2017 8:30 pm
(@chadkukorola)
Posts: 10
Active Member
Topic starter
 

Mynda,

In my original test it was raw data from .xlsx into Power Pivot, and then raw data from .txt into Power Pivot (2013). I stated incorrectly above, as I didn't have Power Query available. In the end, when loading files (even into .accdb), the .csv/.txt are always handled more quickly by the software.

So, I just never understood why there's a difference. Now that I have Power Query in my toolset at work, it leads me to believe the flat files are still a better option than .xls (maybe not), but just don't understand the technical reasons.

Chad

 
Posted : 27/02/2017 11:36 am
(@mynda)
Posts: 4761
Member Admin
 

Hi Chad,

Can you please clarify a couple of points:

1. When you load the data into the Power Pivot model are you loading data from workbook A into the Power Pivot model in Workbook B, or are you loading data in Workbook A into the Power Pivot model in Workbook A?

2. Is the raw data in Excel already formatted in an Excel Table, or not?

Mynda

 
Posted : 27/02/2017 10:45 pm
(@chadkukorola)
Posts: 10
Active Member
Topic starter
 

Mynda,

1) Raw data is in Workbook A, then pulled into a data model in Workbook B.

2) In this case, the data in Workbook A was not in a table.

Workbook A was a download form an ERP (about ~50K records). For the test, I copied and pasted until the .xlsx was at its 1.04 M record limit. Same process for the .txt file. As noted before, the .xlsx loaded approximately 80% into Workbook B before Power Pivot failed. The .txt file with 7.5 M records uploaded into Workbook B without issue.

Chad

 
Posted : 28/02/2017 7:33 pm
(@mynda)
Posts: 4761
Member Admin
 

Hi Chad,

Thanks for clarifying. I think it will be down to the 32-bit limitation on your version of Excel.

Is the Excel file containing the 1.04M records open when you load into Power Pivot? If so, have you tried loading it with the workbook closed to see if performance is any better?

Mynda

 
Posted : 28/02/2017 8:26 pm
(@chadkukorola)
Posts: 10
Active Member
Topic starter
 

Mynda,

No, the raw data file closed when loading. My only thought is that pulling data from .xls* carries some .xls* "baggage", formatting or otherwise, which is compounded when scaling. However, I've not been able to find the technical answer.

Also, to avoid confusing folks, I have no performance issues after the data is loaded, simply on loading into the data model. It doesn't look like Power Query would overcome the bad data source issue either, so flat files it is!

Chad 

 
Posted : 28/02/2017 9:07 pm
(@mynda)
Posts: 4761
Member Admin
 

Thanks, Chad. I'll ask Microsoft for an explanation.

Mynda

 
Posted : 01/03/2017 12:11 am
(@mynda)
Posts: 4761
Member Admin
 

Hi Chad,

No official word from Microsoft yet, but a fellow MVP offered this explanation:

"If I were to take a wild guess, I would say that 32-bit Excel is running out of memory.

If you examine the internals of a .xlsx, it’s a complete ball of string: text mixed with code mixed with formulas.

Excel has to parse it all, lay it out in memory as a tabular structure, update it all, then extract the values.

It’s limited in the amount of RAM it can attract to do all this in: my guess is it sometimes can’t get enough?"

This makes sense to me since the .xlsx file format is actually a kind of zip file or container of items that make up the file. Bed time reading on .xlsx file format here:

https://msdn.microsoft.com/en-us/library/aa338205%28v=office.12%29.aspx?f=255&MSPPError=-2147217396

Another point to note is it's best practice to bring your data into Power Pivot via Power Query. A word of advice from another MVP:

"I will NEVER EVER, EVER import a text file directly into Power Pivot again.  Period.  Been there, done that, been burned.

When you set up your initial connection it bases the data on a preview of the first 200 rows, iirc.  And if it gets it wrong, and you build your BI on it, my recollection is that you get to rebuild the BI when you realize it.  (You can’t add a schema file afterwards to fix it.)"

Mynda

 
Posted : 01/03/2017 6:34 am
(@chadkukorola)
Posts: 10
Active Member
Topic starter
 

Hi Mynda,

Thanks for the info, makes sense to me. I have experienced some of the heartache from direct data connections to Power Pivot. I'd love to hear from MS if they do get back to you.

Bottom line, I'm loving Power Query!!

Chad

 
Posted : 01/03/2017 1:26 pm
(@mynda)
Posts: 4761
Member Admin
 

Hi Chad,

Microsoft say:

"Power Pivot uses ACE for reading both types of files and provide an abstraction. It might have to do with that provider’s handling of such files. I wonder if you can use the provider to read data into Access for example, Mynda, to see what the performance difference is outside Power Pivot."

So, if you have time can you try to import the same Excel and CSV files into Access to see if you get similar results to when you load into Power Pivot. This will help identify if the cause is ACE or not.

Mynda

P.S. I'm so pleased you're loving Power Query 🙂

 
Posted : 01/03/2017 8:41 pm
(@catalinb)
Posts: 1937
Member Admin
 

Hi Chad,

If you're using a 64 bit operating system, and office is in the 32 bit version, if you get the Out of Memory error, or "Excel cannot complete this task with available resources." ,  you can increase the 2 Gb memory limit by using a patch:

http://ntcore.com/4gb_patch.php

There is a Microsoft Update release that does a similar thing: update-for-excel-2013-kb3115162

More details can be found in the first link.

 
Posted : 02/03/2017 3:39 am
(@mynda)
Posts: 4761
Member Admin
 

Hi Chad,

Please test Access before using the patch Catalin suggests so that we can compare like for like.

Thanks,

Mynda

 
Posted : 02/03/2017 6:14 am
(@chadkukorola)
Posts: 10
Active Member
Topic starter
 

Mynda,

I attached Word document with of the results of my testing. Hopefully I did the upload correctly. Of note, it appears the Power Query engine handles .xlsx well, compared to Vertipaq. The test data set is obviously an extreme case, but the idea that this is the format of my data sources (a construct I can't change) is very real.

Let me know what you think!

Chad

 
Posted : 02/03/2017 12:05 pm
(@chadkukorola)
Posts: 10
Active Member
Topic starter
 

Mynda,

I should note:

I tested an .xlsx file with 46 fields and 1.04M records, but very low cardinality. Power Pivot imported that file with little issue. Impressively, with that file, compression went from 177MB (raw data file) to about 700KB once loaded to Power Pivot and saving the excel file.

Chad

 
Posted : 02/03/2017 12:33 pm
Page 1 / 2
Share: