Question

我有一张约50,000条记录的表格。它看起来像这样：

Animal | Name | Color | Legs
Cat    |George| Black | 4
Cat    | Bob  | Brown | 4
Cat    | Dil  | Brown | 4
Bird   | Irv  | Green | 2
Bird   | Van  | Red   | 2

等

我想只插入一次Cat和Bird只插入一次，依此类推。名称/颜色/腿等应该是它找到的第一个值。

此表有10列和50k行。

我试过了insert into MyNewTable Select Distinct * From MyAnimalTable，但那并没有奏效。我也试过了group by，但也没有用。

Answer 1

您只能在动物名称上使用group by，并从Max（）中选择列的其余部分以获得第一个结果。

insert into MyNewTable 
Select MAT.Animal,max(MAT.Name),max(MAT.Color),max(MAT.Legs)
From MyAnimalTable MAT GROUP BY MAT.Animal

Answer 2

使用ROW_NUMBER对每只动物的行进行编号，并保留编号为1的行。

insert into mynewtable (animal, name, color, legs)
select animal, name, color, legs
from
(
  select 
    animal, name, color, legs,
    row_number() over (partition by animal order by animal) as rn
  from myanimaltable a
) numbered
where rn = 1;

（这会对每只动物的记录进行任意编号。所以你得到每只动物的第一条记录＆＃34; DBMS找到＆＃34;。如果你想要任何特定的顺序，你必须在分区后指定这个子句。）

Answer 3

试试这个，

SELECT A.Animal
    ,B.NAME
    ,C.color
    ,A.Legs
FROM (
    SELECT DISTINCT Animal
        ,Legs
    FROM tablename
    ) A
CROSS JOIN (
    SELECT DISTINCT NAME
    FROM tablename
    ) B
CROSS JOIN (
    SELECT DISTINCT Color
    FROM tablename
    ) C

Answer 4

您可以尝试以下操作：

# Create a simple DataFrame, stored into a partition directory
df1 = sqlContext.createDataFrame(sc.parallelize(range(1, 6))\
                                   .map(lambda i: Row(single=i, double=i * 2)))
df1.save("data/test_table/key=1", "parquet")

# Create another DataFrame in a new partition directory,
# adding a new column and dropping an existing column
df2 = sqlContext.createDataFrame(sc.parallelize(range(6, 11))
                                   .map(lambda i: Row(single=i, triple=i * 3)))
df2.save("data/test_table/key=2", "parquet")

# Read the partitioned table
df3 = sqlContext.parquetFile("data/test_table")
df3.printSchema()

Answer 5

查询的最佳简单解决方案

splunkhost = 'MY_IP'
splunkUrl = 'https://%s:8090/services/collector/event' % splunkhost
splunkData = {'index':'client-myclient','event':'Event Python Test'}
splunkResponse = requests.get(splunkUrl, headers={'Authorization': 'Splunk AUTH_CODE'}, data = splunkData, verify=False)
print splunkResponse.text

/Library/Python/2.7/site-packages/requests/packages/urllib3/connectionpool.py:789: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html InsecureRequestWarning)
{"text":"The requested URL was not found on this server.","code":404}

如何将值作为不同的列插入？

5 个答案: