Forum

Problem Scraping Pa...
 
Notifications
Clear all

Problem Scraping Paginated Table- Resets Order After Each Page

2 Posts
2 Users
0 Reactions
174 Views
(@mikemikemike)
Posts: 1
New Member
Topic starter
 

I found your video "Scrape Data from Multiple Web Pages with Power Query" and found it very helpful. After following the steps the table I realized I was getting duplicates. I assumed I did it wrong, tried it several ways and realized that the website inserts the order randomly each time and going from page 1 to 2, 3, etc. was resetting the random order. If it was just duplicates that would be easy, but each duplicate row in the query takes the place of a record that wasn't chosen for that random generator.  

URL: (PageStart variable in red)

Example 1: https://www.nachi.org/certified-inspectors/browse/us?page= 1

Example 2: https://www.nachi.org/certified-inspectors/browse/us/ florida?page=1

The attached file has Example 1, which is all listings for the United States across 100+ pages. I started scraping Example 2 because searching for the whole country doesn't show which state they're in. I included Example 1 in the file because it's less complicated and has the same duplicate problem. I tried looking at the source code of the page for help figuring out what might be causing the problem. I can see that it might be using a cookie to change "no-js" or "has-js." I don't know much about coding but that might be the trigger that resets the random order. That's my best guess, any help would be greatly appreciated!

 

EDIT: After uploading I noticed I type the source URL in the function.  I originally had "https://www.nachi.org/certified-inspectors/browse/us?page="&PageStart&"1" but should have removed the 1 on the end to be &PageStart&"".  It had the same problem, it just went to Page 11, 21, 31, etc. instead of 1, 2, 3...

Capture2.PNG Capture.PNG 

Attachments: 

Capture.PNG- screenshot of website source code

Capture2.PNG- Duplicates on the table coming from different pages

Web Scraping test.pbix- sample file/data

 
Posted : 11/02/2022 1:19 pm
Philip Treacy
(@philipt)
Posts: 1629
Member Admin
 

Hi Mike,

That website isn't giving the results correctly.  It shouldn't be duplicating results like that.  Not much we can do to prevent this as all the code is doing is querying the website.

All you need to do is to Remove Duplicate from the Member URL column.

Regards

Phil

 
Posted : 15/02/2022 12:01 am
Share: