Google-sheets – how to clean up table data in Google Sheet IMPORTHTML

google sheetsimporthtml

I want to import a table of nuclear power output from https://ru.wikipedia.org/wiki/Атомная_энергетика_по_странам.

For some reason, the second col has the number duplicated, eg

<td><span style="display: none; speak: none;">000935</span>
935</td>

Is there a way to affect the work of IMPORTHTML() to get only the first child of <td>?

Similarly, in the second table, last column, Google Sheet extracts this:

British Nuclear Fuels (англ.)русск.

I want to remove the texts (англ.) and русск.

The text I need is again in the first child of <td>:

<td><a href="//en.wikipedia.org/wiki/British_Nuclear_Fuels" class="extiw" title="en:British Nuclear Fuels">British Nuclear Fuels</a>
<span style="font-size:95%; position: relative; top: .4em">&nbsp;<span class="ref-info" title="на английском языке" style="font-size:85%; cursor:help; color:#888;">(англ.)</span>
</span><span class="link-ru" style="font-size:80%; margin-left:-1.7em; position: relative; top: -.4em;"><a href="/w/index.php?title=British_Nuclear_Fuels&amp;action=edit&amp;redlink=1" class="new" title="British Nuclear Fuels (страница отсутствует)">русск.</a></span></td>

Also, I'd like to clean up bracketed references from col headings, eg [7]

Best Answer

  • The reason the numbers are duplicated is because they use the ~ wiki template for the purpose of sorting | {{~|001627}}1 627
  • the new better way to sort is: https://meta.wikimedia.org/wiki/Help:Sorting#Specifying_a_sort_key
  • I was able to clean with formulas, eg source cleaned halved check 101240102 709 101240102709 102709 FALSE

Using these formulas:

  • cleaned: =REGEXREPLACE(TEXT(source,"0"),"^0+| ","")
  • halved: =replace(cleaned,1,len(cleaned)/2,"")
  • check: =cleaned=concat(halved,halved)

A few values don't check out, but the difference between the two halves is small, so that's ok