Google-sheets – how to clean up table data in Google Sheet IMPORTHTML

google sheetsimporthtml

I want to import a table of nuclear power output from https://ru.wikipedia.org/wiki/Атомная_энергетика_по_странам.

For some reason, the second col has the number duplicated, eg

<td><span style="display: none; speak: none;">000935</span>
935</td>

Is there a way to affect the work of IMPORTHTML() to get only the first child of <td>?

Similarly, in the second table, last column, Google Sheet extracts this:

British Nuclear Fuels (англ.)русск.

I want to remove the texts (англ.) and русск.

The text I need is again in the first child of <td>:

<td><a href="//en.wikipedia.org/wiki/British_Nuclear_Fuels" class="extiw" title="en:British Nuclear Fuels">British Nuclear Fuels</a>
<span style="font-size:95%; position: relative; top: .4em">&nbsp;<span class="ref-info" title="на английском языке" style="font-size:85%; cursor:help; color:#888;">(англ.)</span>
</span><span class="link-ru" style="font-size:80%; margin-left:-1.7em; position: relative; top: -.4em;"><a href="/w/index.php?title=British_Nuclear_Fuels&amp;action=edit&amp;redlink=1" class="new" title="British Nuclear Fuels (страница отсутствует)">русск.</a></span></td>

Also, I'd like to clean up bracketed references from col headings, eg [7]

Best Answer

The reason the numbers are duplicated is because they use the ~ wiki template for the purpose of sorting | {{~|001627}}1 627
the new better way to sort is: https://meta.wikimedia.org/wiki/Help:Sorting#Specifying_a_sort_key
I was able to clean with formulas, eg source cleaned halved check 101240102 709 101240102709 102709 FALSE

Using these formulas:

cleaned: =REGEXREPLACE(TEXT(source,"0"),"^0+| ","")
halved: =replace(cleaned,1,len(cleaned)/2,"")
check: =cleaned=concat(halved,halved)

A few values don't check out, but the difference between the two halves is small, so that's ok

Related Solutions

Google Sheets – Modifying Data Imported with ImportHTML

The solution was to select all the cells (I can't find a Select All menu item, but on OS X Command-A worked), then choose Copy, then Paste special/Paste values only. Bingo: all the cells whose values depend on the ImportHTML() in A1 now have their values pasted.

Google-sheets – Get the sum of every row in a table imported with Importhtml

I don't see why you have arrayformula and index there; they don't do much (other than index messing up formatting of dates and currency). I don't see what "0" is supposed to do in the importhtml command, either.

In order to get the sum of particular columns, you can use query.

=query(importhtml("http:...", "table"), "select Col1, Col3+Col4+Col5+Col6 label Col3+Col4+Col5+Col6 'Total'")

(The part label Col3+Col4+Col5+Col6 'Total' is optional, it just makes the header look nicer.)

If you want to also have the quarter data, then include the quarters column too; you may also take the opportunity to rearrange them chronologically

=query(importhtml("http:...", "table"), "select Col1, Col3, Col6, Col5, Col4, Col3+Col4+Col5+Col6 label Col3+Col4+Col5+Col6 'Total'")

Finally, notice that the ending dates of each quarter is not the sort of thing we want to add up. One way to avoid adding them is to use the third parameter of query, the number of headers:

=query(..., "...", 2)

This tells the query that the second row is also a header, so its contents will be added to the header. Another solution is to add a where clause to the query string to exclude this row:

select ... where Col1 != 'Quarter Ending:' label ...

Best Answer

Related Solutions

Google Sheets – Modifying Data Imported with ImportHTML

Google-sheets – Get the sum of every row in a table imported with Importhtml

Related Topic